Analysis of Environmental Exposures, Diet, and Metabolomics on Child BMI Z-Score

Abstract

Background: Childhood obesity is a growing public health concern with significant implications for physical and mental health. Understanding the multifactorial influences on childhood Body Mass Index (BMI) is crucial for developing effective interventions and policies.

Objective: This study aims to predict childhood BMI Z-scores by integrating comprehensive data on postnatal diet, chemical exposures, and serum metabolomics, using advanced statistical models.

Methods: Data from the HELIX study, including 1301 mother-child pairs aged 6-11 years, were analyzed. Various modeling approaches, including LASSO, ridge, elastic net, decision tree, random forest, and gradient boosting machine (GBM), were employed to predict BMI Z-scores. Models were evaluated based on their Root Mean Squared Error (RMSE) and the significance of the predictors identified.

Results: The results demonstrate that models incorporating a comprehensive set of variables (diet, chemicals, metabolomics, and covariates) consistently outperformed those with fewer variables. Specifically, the Group Lasso model achieved the lowest Root Mean Squared Error (RMSE) of 0.875 with 10-fold cross-validation, followed closely by Group Ridge (RMSE: 0.885) and Group Elastic Net (RMSE: 0.885). Ensemble methods such as Random Forest and GBM also showed robust performance, with GBM yielding an RMSE of 0.954.

Key variables identified across the models include demographic factors such as age and sex, dietary factors like breastfeeding duration and intake of bakery products, and chemical exposures including hs_pcb170_cadj_Log2 and hs_pbde153_cadj_Log2. The addition of metabolomic data significantly enhanced the predictive accuracy, highlighting its potential utility in obesity research.

Conclusion: The findings highlight the critical role of chemical exposures and metabolomic profiles in determining childhood BMI. Incorporating diverse data sources significantly enhances model performance, providing a holistic understanding of the determinants of childhood obesity. Future research should focus on improving data quality, exploring additional covariates, and assessing the generalizability of the models to different populations.

Introduction

Background

Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.

A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.

The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.

The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.

These studies collectively illustrate the critical role of environmental exposures in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.

Hypothesis

How are postnatal environmental exposures, specifically those found in household products and dietary intake, along with specific serum metabolomics profiles, associated with the BMI Z-score of children aged 6-11 years? Specifically, do higher concentrations of certain serum metabolites, indicative of exposure to chemical classes or metals, correlate with variations in BMI Z-score when controlling for age and other relevant covariates? Additionally, can these metabolites serve as biomarkers for the risk of developing obesity in children?

Methods

Data Description

This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between chemical exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.

However in terms of the metabolomic data, the values were excluded since there were too many entries. This was preferred over being imputed by mean or median since having any entries that had no metabolomic serum information would have the same values.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
variable_name domain family subfamily period location period_postnatal description var_type transformation labels labelsshort
h_bfdur_Ter h_bfdur_Ter Lifestyles Lifestyle Diet Postnatal NA NA Breastfeeding duration (weeks) factor Tertiles Breastfeeding Breastfeeding
hs_bakery_prod_Ter hs_bakery_prod_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bakery products (hs_cookies + hs_pastries) factor Tertiles Bakery prod BakeProd
hs_beverages_Ter hs_beverages_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: beverages (hs_dietsoda+hs_soda) factor Tertiles Soda Soda
hs_break_cer_Ter hs_break_cer_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: breakfast cereal (hs_sugarcer+hs_othcer) factor Tertiles BF cereals BFcereals
hs_caff_drink_Ter hs_caff_drink_Ter Lifestyles Lifestyle Diet Postnatal NA NA Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) factor Tertiles Caffeine Caffeine
hs_dairy_Ter hs_dairy_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) factor Tertiles Dairy Dairy
hs_fastfood_Ter hs_fastfood_Ter Lifestyles Lifestyle Diet Postnatal NA NA Visits a fast food restaurant/take away factor Tertiles Fastfood Fastfood
hs_KIDMED_None hs_KIDMED_None Lifestyles Lifestyle Diet Postnatal NA NA Sum of KIDMED indices, without index9 numeric None KIDMED KIDMED
hs_mvpa_prd_alt_None hs_mvpa_prd_alt_None Lifestyles Lifestyle Physical activity Postnatal NA NA Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) numeric None PA PA
hs_org_food_Ter hs_org_food_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats organic food factor Tertiles Organicfood Organicfood
hs_proc_meat_Ter hs_proc_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: processed meat (hs_coldmeat+hs_ham) factor Tertiles Processed meat ProcMeat
hs_readymade_Ter hs_readymade_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats a æready-made supermarket meal factor Tertiles Ready made food ReadyFood
hs_sd_wk_None hs_sd_wk_None Lifestyles Lifestyle Physical activity Postnatal NA NA sedentary behaviour (min/day) numeric None Sedentary Sedentary
hs_total_bread_Ter hs_total_bread_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bread (hs_darkbread+hs_whbread) factor Tertiles Bread Bread
hs_total_cereal_Ter hs_total_cereal_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) factor Tertiles Cereals Cereals
hs_total_fish_Ter hs_total_fish_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) factor Tertiles Fish Fish
hs_total_fruits_Ter hs_total_fruits_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) factor Tertiles Fruits Fruits
hs_total_lipids_Ter hs_total_lipids_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: Added fat factor Tertiles Diet fat Diet fat
hs_total_meat_Ter hs_total_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) factor Tertiles Meat Meat
hs_total_potatoes_Ter hs_total_potatoes_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: potatoes (hs_frenchfries+hs_potatoes) factor Tertiles Potatoes Potatoes
hs_total_sweets_Ter hs_total_sweets_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: sweets (hs_choco + hs_sweets + hs_sugar) factor Tertiles Sweets Sweets
hs_total_veg_Ter hs_total_veg_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: vegetables (hs_cookveg+hs_rawveg) factor Tertiles Vegetables Vegetables
hs_total_yog_Ter hs_total_yog_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: yogurt (hs_yogurt+hs_probiotic) factor Tertiles Yogurt Yogurt
hs_dif_hours_total_None hs_dif_hours_total_None Lifestyles Lifestyle Sleep Postnatal NA NA Total hours of sleep (mean weekdays and night) numeric None Sleep Sleep
hs_as_c_Log2 hs_as_c_Log2 Chemicals Metals As Postnatal NA NA Arsenic (As) in child numeric Logarithm base 2 As As
hs_cd_c_Log2 hs_cd_c_Log2 Chemicals Metals Cd Postnatal NA NA Cadmium (Cd) in child numeric Logarithm base 2 Cd Cd
hs_co_c_Log2 hs_co_c_Log2 Chemicals Metals Co Postnatal NA NA Cobalt (Co) in child numeric Logarithm base 2 Co Co
hs_cs_c_Log2 hs_cs_c_Log2 Chemicals Metals Cs Postnatal NA NA Caesium (Cs) in child numeric Logarithm base 2 Cs Cs
hs_cu_c_Log2 hs_cu_c_Log2 Chemicals Metals Cu Postnatal NA NA Copper (Cu) in child numeric Logarithm base 2 Cu Cu
hs_hg_c_Log2 hs_hg_c_Log2 Chemicals Metals Hg Postnatal NA NA Mercury (Hg) in child numeric Logarithm base 2 Hg Hg
hs_mn_c_Log2 hs_mn_c_Log2 Chemicals Metals Mn Postnatal NA NA Manganese (Mn) in child numeric Logarithm base 2 Mn Mn
hs_mo_c_Log2 hs_mo_c_Log2 Chemicals Metals Mo Postnatal NA NA Molybdenum (Mo) in child numeric Logarithm base 2 Mo Mo
hs_pb_c_Log2 hs_pb_c_Log2 Chemicals Metals Pb Postnatal NA NA Lead (Pb) in child numeric Logarithm base 2 Pb Pb
hs_tl_cdich_None hs_tl_cdich_None Chemicals Metals Tl Postnatal NA NA Dichotomous variable of thallium (Tl) in child factor None Tl Tl
hs_dde_cadj_Log2 hs_dde_cadj_Log2 Chemicals Organochlorines DDE Postnatal NA NA Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids numeric Logarithm base 2 DDE DDE
hs_ddt_cadj_Log2 hs_ddt_cadj_Log2 Chemicals Organochlorines DDT Postnatal NA NA Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids numeric Logarithm base 2 DDT DDT
hs_hcb_cadj_Log2 hs_hcb_cadj_Log2 Chemicals Organochlorines HCB Postnatal NA NA Hexachlorobenzene (HCB) in child adjusted for lipids numeric Logarithm base 2 HCB HCB
hs_pcb118_cadj_Log2 hs_pcb118_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids numeric Logarithm base 2 PCB 118 PCB118
hs_pcb138_cadj_Log2 hs_pcb138_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids numeric Logarithm base 2 PCB 138 PCB138
hs_pcb153_cadj_Log2 hs_pcb153_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids numeric Logarithm base 2 PCB 153 PCB153
hs_pcb170_cadj_Log2 hs_pcb170_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids numeric Logarithm base 2 PCB 170 PCB170
hs_pcb180_cadj_Log2 hs_pcb180_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids numeric Logarithm base 2 PCB 180 PCB180
hs_sumPCBs5_cadj_Log2 hs_sumPCBs5_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Sum of PCBs in child adjusted for lipids (4 cohorts) numeric Logarithm base 2 PCBs SumPCB
hs_dep_cadj_Log2 hs_dep_cadj_Log2 Chemicals Organophosphate pesticides DEP Postnatal NA NA Diethyl phosphate (DEP) in child adjusted for creatinine numeric Logarithm base 2 DEP DEP
hs_detp_cadj_Log2 hs_detp_cadj_Log2 Chemicals Organophosphate pesticides DETP Postnatal NA NA Diethyl thiophosphate (DETP) in child adjusted for creatinine numeric Logarithm base 2 DETP DETP
hs_dmdtp_cdich_None hs_dmdtp_cdich_None Chemicals Organophosphate pesticides DMDTP Postnatal NA NA Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child factor None DMDTP DMDTP
hs_dmp_cadj_Log2 hs_dmp_cadj_Log2 Chemicals Organophosphate pesticides DMP Postnatal NA NA Dimethyl phosphate (DMP) in child adjusted for creatinine numeric Logarithm base 2 DMP DMP
hs_dmtp_cadj_Log2 hs_dmtp_cadj_Log2 Chemicals Organophosphate pesticides DMTP Postnatal NA NA Dimethyl thiophosphate (DMTP) in child adjusted for creatinine numeric Logarithm base 2 DMDTP DMTP
hs_pbde153_cadj_Log2 hs_pbde153_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE153 Postnatal NA NA Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids numeric Logarithm base 2 PBDE 153 PBDE153
hs_pbde47_cadj_Log2 hs_pbde47_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE47 Postnatal NA NA Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids numeric Logarithm base 2 PBDE 47 PBDE47
hs_pfhxs_c_Log2 hs_pfhxs_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFHXS Postnatal NA NA Perfluorohexane sulfonate (PFHXS) in child numeric Logarithm base 2 PFHXS PFHXS
hs_pfna_c_Log2 hs_pfna_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFNA Postnatal NA NA Perfluorononanoate (PFNA) in child numeric Logarithm base 2 PFNA PFNA
hs_pfoa_c_Log2 hs_pfoa_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOA Postnatal NA NA Perfluorooctanoate (PFOA) in child numeric Logarithm base 2 PFOA PFOA
hs_pfos_c_Log2 hs_pfos_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOS Postnatal NA NA Perfluorooctane sulfonate (PFOS) in child numeric Logarithm base 2 PFOS PFOS
hs_pfunda_c_Log2 hs_pfunda_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFUNDA Postnatal NA NA Perfluoroundecanoate (PFUNDA) in child numeric Logarithm base 2 PFUNDA PFUNDA
hs_bpa_cadj_Log2 hs_bpa_cadj_Log2 Chemicals Phenols BPA Postnatal NA NA Bisphenol A (BPA) in child adjusted for creatinine numeric Logarithm base 2 BPA BPA
hs_bupa_cadj_Log2 hs_bupa_cadj_Log2 Chemicals Phenols BUPA Postnatal NA NA N-Butyl paraben (BUPA) in child adjusted for creatinine numeric Logarithm base 2 BUPA BUPA
hs_etpa_cadj_Log2 hs_etpa_cadj_Log2 Chemicals Phenols ETPA Postnatal NA NA Ethyl paraben (ETPA) in child adjusted for creatinine numeric Logarithm base 2 ETPA ETPA
hs_mepa_cadj_Log2 hs_mepa_cadj_Log2 Chemicals Phenols MEPA Postnatal NA NA Methyl paraben (MEPA) in child adjusted for creatinine numeric Logarithm base 2 MEPA MEPA
hs_oxbe_cadj_Log2 hs_oxbe_cadj_Log2 Chemicals Phenols OXBE Postnatal NA NA Oxybenzone (OXBE) in child adjusted for creatinine numeric Logarithm base 2 OXBE OXBE
hs_prpa_cadj_Log2 hs_prpa_cadj_Log2 Chemicals Phenols PRPA Postnatal NA NA Propyl paraben (PRPA) in child adjusted for creatinine numeric Logarithm base 2 PRPA PRPA
hs_trcs_cadj_Log2 hs_trcs_cadj_Log2 Chemicals Phenols TRCS Postnatal NA NA Triclosan (TRCS) in child adjusted for creatinine numeric Logarithm base 2 TRCS TRCS
hs_mbzp_cadj_Log2 hs_mbzp_cadj_Log2 Chemicals Phthalates MBZP Postnatal NA NA Mono benzyl phthalate (MBzP) in child adjusted for creatinine numeric Logarithm base 2 MBZP MBZP
hs_mecpp_cadj_Log2 hs_mecpp_cadj_Log2 Chemicals Phthalates MECPP Postnatal NA NA Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine numeric Logarithm base 2 MECPP MECPP
hs_mehhp_cadj_Log2 hs_mehhp_cadj_Log2 Chemicals Phthalates MEHHP Postnatal NA NA Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine numeric Logarithm base 2 MEHHP MEHHP
hs_mehp_cadj_Log2 hs_mehp_cadj_Log2 Chemicals Phthalates MEHP Postnatal NA NA Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine numeric Logarithm base 2 MEHP MEHP
hs_meohp_cadj_Log2 hs_meohp_cadj_Log2 Chemicals Phthalates MEOHP Postnatal NA NA Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine numeric Logarithm base 2 MEOHP MEOHP
hs_mep_cadj_Log2 hs_mep_cadj_Log2 Chemicals Phthalates MEP Postnatal NA NA Monoethyl phthalate (MEP) in child adjusted for creatinine numeric Logarithm base 2 MEP MEP
hs_mibp_cadj_Log2 hs_mibp_cadj_Log2 Chemicals Phthalates MIBP Postnatal NA NA Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine numeric Logarithm base 2 MIBP MIBP
hs_mnbp_cadj_Log2 hs_mnbp_cadj_Log2 Chemicals Phthalates MNBP Postnatal NA NA Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine numeric Logarithm base 2 MNBP MNBP
hs_ohminp_cadj_Log2 hs_ohminp_cadj_Log2 Chemicals Phthalates OHMiNP Postnatal NA NA Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine numeric Logarithm base 2 OHMiNP OHMiNP
hs_oxominp_cadj_Log2 hs_oxominp_cadj_Log2 Chemicals Phthalates OXOMINP Postnatal NA NA Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine numeric Logarithm base 2 OXOMINP OXOMINP
hs_sumDEHP_cadj_Log2 hs_sumDEHP_cadj_Log2 Chemicals Phthalates DEHP Postnatal NA NA Sum of DEHP metabolites (µg/g) in child adjusted for creatinine numeric Logarithm base 2 DEHP SumDEHP
FAS_cat_None FAS_cat_None Chemicals Social and economic capital Economic capital Postnatal NA NA Family affluence score factor None Family affluence FamAfl
hs_contactfam_3cat_num_None hs_contactfam_3cat_num_None Chemicals Social and economic capital Social capital Postnatal NA NA scoial capital: family friends factor None Social contact SocCont
hs_hm_pers_None hs_hm_pers_None Chemicals Social and economic capital Social capital Postnatal NA NA How many people live in your home? numeric None House crowding HouseCrow
hs_participation_3cat_None hs_participation_3cat_None Chemicals Social and economic capital Social capital Postnatal NA NA social capital: structural factor None Social participation SocPartic
hs_cotinine_cdich_None hs_cotinine_cdich_None Chemicals Tobacco Smoke Cotinine Postnatal NA NA Dichotomous variable of cotinine in child factor None Cotinine Cotinine
hs_globalexp2_None hs_globalexp2_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Global exposure of the child to ETS (2 categories) factor None ETS ETS
hs_smk_parents_None hs_smk_parents_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Tobacco Smoke status of parents (both) factor None Smoking_parents SmokPar
e3_sex_None e3_sex_None Covariates Covariates Child covariate Pregnancy NA NA Child sex (female / male) factor None Child sex Sex
e3_yearbir_None e3_yearbir_None Covariates Covariates Child covariate Pregnancy NA NA Year of birth (2003 to 2009) factor None Year of birth YearBirth
hs_child_age_None hs_child_age_None Covariates Covariates Child covariate Postnatal NA NA Child age at examination (years) numeric None Child age cAge
hs_zbmi_who hs_zbmi_who Phenotype Phenotype Outcome at 6-11 years old Postnatal NA NA Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age numeric None Body mass index z-score zBMI

Data Summary for Exposures, Covariates, and Outcome

Data Summary Exposures: Dietary

These variables were categorized into tertiles to assess the impact of different levels of dietary intake on BMI Z-scores. The dietary intake variables included:

  • h_bfdur_Ter: Duration of breastfeeding

  • hs_bakery_prod_Ter: Intake of bakery products

  • hs_break_cer_Ter: Intake of breakfast cereals

  • hs_dairy_Ter: Intake of dairy products

  • hs_fastfood_Ter: Intake of fast food

  • hs_org_food_Ter: Intake of organic food

  • hs_proc_meat_Ter: Intake of processed meat

  • hs_total_fish_Ter: Intake of total fish

  • hs_total_fruits_Ter: Intake of total fruits

  • hs_total_lipids_Ter: Intake of total lipids

  • hs_total_sweets_Ter: Intake of total sweets

  • hs_total_veg_Ter: Intake of total vegetables

# specific lifestyle exposures
dietary_exposures <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

dietary_exposome <- dplyr::select(exposome, all_of(dietary_exposures))
summarytools::view(dfSummary(dietary_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 h_bfdur_Ter [factor]
1. (0,10.8]
2. (10.8,34.9]
3. (34.9,Inf]
506(38.9%)
270(20.8%)
525(40.4%)
0 (0.0%)
2 hs_bakery_prod_Ter [factor]
1. (0,2]
2. (2,6]
3. (6,Inf]
345(26.5%)
423(32.5%)
533(41.0%)
0 (0.0%)
3 hs_break_cer_Ter [factor]
1. (0,1.1]
2. (1.1,5.5]
3. (5.5,Inf]
291(22.4%)
521(40.0%)
489(37.6%)
0 (0.0%)
4 hs_dairy_Ter [factor]
1. (0,14.6]
2. (14.6,25.6]
3. (25.6,Inf]
359(27.6%)
465(35.7%)
477(36.7%)
0 (0.0%)
5 hs_fastfood_Ter [factor]
1. (0,0.132]
2. (0.132,0.5]
3. (0.5,Inf]
143(11.0%)
603(46.3%)
555(42.7%)
0 (0.0%)
6 hs_org_food_Ter [factor]
1. (0,0.132]
2. (0.132,1]
3. (1,Inf]
429(33.0%)
396(30.4%)
476(36.6%)
0 (0.0%)
7 hs_proc_meat_Ter [factor]
1. (0,1.5]
2. (1.5,4]
3. (4,Inf]
366(28.1%)
471(36.2%)
464(35.7%)
0 (0.0%)
8 hs_total_fish_Ter [factor]
1. (0,1.5]
2. (1.5,3]
3. (3,Inf]
389(29.9%)
454(34.9%)
458(35.2%)
0 (0.0%)
9 hs_total_fruits_Ter [factor]
1. (0,7]
2. (7,14.1]
3. (14.1,Inf]
413(31.7%)
407(31.3%)
481(37.0%)
0 (0.0%)
10 hs_total_lipids_Ter [factor]
1. (0,3]
2. (3,7]
3. (7,Inf]
397(30.5%)
403(31.0%)
501(38.5%)
0 (0.0%)
11 hs_total_sweets_Ter [factor]
1. (0,4.1]
2. (4.1,8.5]
3. (8.5,Inf]
344(26.4%)
516(39.7%)
441(33.9%)
0 (0.0%)
12 hs_total_veg_Ter [factor]
1. (0,6]
2. (6,8.5]
3. (8.5,Inf]
404(31.1%)
314(24.1%)
583(44.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30

categorical_diet <- dietary_exposome %>% 
  dplyr::select(where(is.factor))

categorical_diet_long <- pivot_longer(
  categorical_diet,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_diet_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_diet_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.

Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.

Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.

Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.

Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.

Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.

Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.

Bread: Distribution shows a significant leaning towards higher bread consumption.

Cereal: Even distribution across categories suggests varied cereal consumption habits.

Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.

Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.

Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.

Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.

Vegetables: Most participants consume a high amount of vegetables.

Data Summary Exposures: Chemicals

This study included a comprehensive assessment of various chemical exposures to understand their impact on childhood BMI Z-scores. The specific chemical exposures selected for analysis were as follows:

  • hs_cd_c_Log2: Cadmium concentration

  • hs_co_c_Log2: Cobalt concentration

  • hs_cs_c_Log2: Cesium concentration

  • hs_cu_c_Log2: Copper concentration

  • hs_hg_c_Log2: Mercury concentration

  • hs_mo_c_Log2: Molybdenum concentration

  • hs_pb_c_Log2: Lead concentration

  • hs_dde_cadj_Log2: DDE (a breakdown product of DDT) concentration, adjusted

  • hs_pcb153_cadj_Log2: PCB 153 concentration, adjusted

  • hs_pcb170_cadj_Log2: PCB 170 concentration, adjusted

  • hs_dep_cadj_Log2: DEP (Diethyl phthalate) concentration, adjusted

  • hs_pbde153_cadj_Log2: PBDE 153 (Polybrominated diphenyl ether) concentration, adjusted

  • hs_pfhxs_c_Log2: PFHxS (Perfluorohexane sulfonic acid) concentration

  • hs_pfoa_c_Log2: PFOA (Perfluorooctanoic acid) concentration

  • hs_pfos_c_Log2: PFOS (Perfluorooctane sulfonic acid) concentration

  • hs_prpa_cadj_Log2: PRPA (Propargyl alcohol) concentration, adjusted

  • hs_mbzp_cadj_Log2: MBzP (Mono-benzyl phthalate) concentration, adjusted

  • hs_mibp_cadj_Log2: MiBP (Mono-isobutyl phthalate) concentration, adjusted

  • hs_mnbp_cadj_Log2: MnBP (Mono-n-butyl phthalate) concentration, adjusted

# specific chemical exposures
chemical_exposures <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_cd_c_Log2 [numeric]
Mean (sd) : -4 (1)
min ≤ med ≤ max:
-10.4 ≤ -3.8 ≤ 0.8
IQR (CV) : 1 (-0.3)
695 distinct values 0 (0.0%)
2 hs_co_c_Log2 [numeric]
Mean (sd) : -2.3 (0.6)
min ≤ med ≤ max:
-5.5 ≤ -2.4 ≤ 1.4
IQR (CV) : 0.7 (-0.3)
317 distinct values 0 (0.0%)
3 hs_cs_c_Log2 [numeric]
Mean (sd) : 0.4 (0.6)
min ≤ med ≤ max:
-1.5 ≤ 0.5 ≤ 3.1
IQR (CV) : 0.8 (1.3)
369 distinct values 0 (0.0%)
4 hs_cu_c_Log2 [numeric]
Mean (sd) : 9.8 (0.2)
min ≤ med ≤ max:
9.1 ≤ 9.8 ≤ 12.1
IQR (CV) : 0.3 (0)
345 distinct values 0 (0.0%)
5 hs_hg_c_Log2 [numeric]
Mean (sd) : -0.3 (1.7)
min ≤ med ≤ max:
-10.9 ≤ -0.2 ≤ 3.7
IQR (CV) : 2.1 (-5.6)
698 distinct values 0 (0.0%)
6 hs_mo_c_Log2 [numeric]
Mean (sd) : -0.3 (0.9)
min ≤ med ≤ max:
-9.2 ≤ -0.4 ≤ 5.1
IQR (CV) : 0.8 (-2.9)
593 distinct values 0 (0.0%)
7 hs_pb_c_Log2 [numeric]
Mean (sd) : 3.1 (0.6)
min ≤ med ≤ max:
1.1 ≤ 3.1 ≤ 7.7
IQR (CV) : 0.8 (0.2)
529 distinct values 0 (0.0%)
8 hs_dde_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1.5)
min ≤ med ≤ max:
1.2 ≤ 4.5 ≤ 11.1
IQR (CV) : 1.9 (0.3)
1050 distinct values 0 (0.0%)
9 hs_pcb153_cadj_Log2 [numeric]
Mean (sd) : 3.6 (0.9)
min ≤ med ≤ max:
1.2 ≤ 3.5 ≤ 7.8
IQR (CV) : 1.4 (0.3)
1047 distinct values 0 (0.0%)
10 hs_pcb170_cadj_Log2 [numeric]
Mean (sd) : -0.3 (3)
min ≤ med ≤ max:
-16.8 ≤ 0.3 ≤ 4.8
IQR (CV) : 2.2 (-9.8)
1039 distinct values 0 (0.0%)
11 hs_dep_cadj_Log2 [numeric]
Mean (sd) : 0.2 (3.2)
min ≤ med ≤ max:
-12.6 ≤ 0.9 ≤ 9.4
IQR (CV) : 3.3 (20)
1045 distinct values 0 (0.0%)
12 hs_pbde153_cadj_Log2 [numeric]
Mean (sd) : -4.5 (3.8)
min ≤ med ≤ max:
-17.6 ≤ -2.6 ≤ 4
IQR (CV) : 6.7 (-0.8)
1036 distinct values 0 (0.0%)
13 hs_pfhxs_c_Log2 [numeric]
Mean (sd) : -1.6 (1.3)
min ≤ med ≤ max:
-8.9 ≤ -1.4 ≤ 4.8
IQR (CV) : 1.7 (-0.8)
1061 distinct values 0 (0.0%)
14 hs_pfoa_c_Log2 [numeric]
Mean (sd) : 0.6 (0.6)
min ≤ med ≤ max:
-2.2 ≤ 0.6 ≤ 2.7
IQR (CV) : 0.7 (0.9)
1061 distinct values 0 (0.0%)
15 hs_pfos_c_Log2 [numeric]
Mean (sd) : 1 (1.1)
min ≤ med ≤ max:
-10.4 ≤ 1 ≤ 5.1
IQR (CV) : 1.3 (1.1)
1050 distinct values 0 (0.0%)
16 hs_prpa_cadj_Log2 [numeric]
Mean (sd) : -1.6 (3.8)
min ≤ med ≤ max:
-12 ≤ -2.3 ≤ 10.8
IQR (CV) : 5.2 (-2.4)
1031 distinct values 0 (0.0%)
17 hs_mbzp_cadj_Log2 [numeric]
Mean (sd) : 2.4 (1.2)
min ≤ med ≤ max:
-0.6 ≤ 2.3 ≤ 7.2
IQR (CV) : 1.5 (0.5)
1046 distinct values 0 (0.0%)
18 hs_mibp_cadj_Log2 [numeric]
Mean (sd) : 5.5 (1.1)
min ≤ med ≤ max:
2.3 ≤ 5.4 ≤ 9.8
IQR (CV) : 1.5 (0.2)
1057 distinct values 0 (0.0%)
19 hs_mnbp_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1)
min ≤ med ≤ max:
1.9 ≤ 4.6 ≤ 8.9
IQR (CV) : 1.3 (0.2)
1048 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30

#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>% 
  dplyr::select(where(is.numeric))

numeric_chemical_long <- pivot_longer(
  numeric_chemical,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_chemical_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_chemical_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Cadmium (hs_cd_c_Log2): The histogram for cadmium exposure shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.

Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.

Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5

Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.

Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.

Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.

Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.

DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..

PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.

PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.

DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.

PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.

PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.

PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.

PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.

Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.

Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.

Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.

Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.

Data Summary Covariates

Covariates were selected on its impact with the postnatal nature of the child. The only exception is sex and year of birth, which was considered as pregnancy. This were used since there are differences amongst gender as well as depending on the child’s age and when they are born.

# Specified covariates
specific_covariates <- c(
  "e3_sex_None", 
  "e3_yearbir_None",
  "hs_child_age_None"
)

covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
2 e3_yearbir_None [factor]
1. 2003
2. 2004
3. 2005
4. 2006
5. 2007
6. 2008
7. 2009
55(4.2%)
107(8.2%)
241(18.5%)
256(19.7%)
250(19.2%)
379(29.1%)
13(1.0%)
0 (0.0%)
3 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30

#separate numeric and categorical data
numeric_covariates <- covariate_data %>% 
  dplyr::select(where(is.numeric))

numeric_covariates_long <- pivot_longer(
  numeric_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_covariates_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_covariates_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.

categorical_covariates <- covariate_data %>% 
  dplyr::select(where(is.factor))

categorical_covariates_long <- pivot_longer(
  categorical_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_covariates_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.

Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.

Data Summary Outcome: Phenotype

The primary outcome of interest in this analysis is the Body Mass Index (BMI) Z-score of children, which is a measure of relative weight adjusted for child age and sex. The BMI Z-score provides a standardized way to compare a child’s BMI with a reference population

outcome_BMI <- phenotype %>% 
  dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_zbmi_who [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
-3.6 ≤ 0.3 ≤ 4.7
IQR (CV) : 1.5 (3)
421 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-30

Descriptive Tables

By Gender

combined_data <- cbind(covariate_data, dietary_exposome, chemical_exposome, outcome_BMI)

combined_data <- combined_data[, !duplicated(colnames(combined_data))]

# sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")

render_cont <- function(x) {
  with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}

render_cat <- function(x) {
  c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}

table1_formula <- ~ 
  hs_child_age_None + e3_yearbir_None + 
  hs_zbmi_who +
  h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
  hs_proc_meat_Ter +
  hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
  hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
  hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
  hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
  hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
  hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None

table1(
  table1_formula,
  data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
Male
(N=608)
Female
(N=693)
TRUE
(N=1301)
hs_child_age_None 7.91 (1.58) 8.03 (1.64) 7.98 (1.61)
e3_yearbir_None
2003 25 (4.1 %) 30 (4.3 %) 55 (4.2 %)
2004 46 (7.6 %) 61 (8.8 %) 107 (8.2 %)
2005 121 (19.9 %) 120 (17.3 %) 241 (18.5 %)
2006 108 (17.8 %) 148 (21.4 %) 256 (19.7 %)
2007 128 (21.1 %) 122 (17.6 %) 250 (19.2 %)
2008 177 (29.1 %) 202 (29.1 %) 379 (29.1 %)
2009 3 (0.5 %) 10 (1.4 %) 13 (1.0 %)
hs_zbmi_who 0.35 (1.15) 0.45 (1.22) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 231 (38.0 %) 275 (39.7 %) 506 (38.9 %)
(10.8,34.9] 118 (19.4 %) 152 (21.9 %) 270 (20.8 %)
(34.9,Inf] 259 (42.6 %) 266 (38.4 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 164 (27.0 %) 181 (26.1 %) 345 (26.5 %)
(2,6] 188 (30.9 %) 235 (33.9 %) 423 (32.5 %)
(6,Inf] 256 (42.1 %) 277 (40.0 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 141 (23.2 %) 150 (21.6 %) 291 (22.4 %)
(1.1,5.5] 251 (41.3 %) 270 (39.0 %) 521 (40.0 %)
(5.5,Inf] 216 (35.5 %) 273 (39.4 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 175 (28.8 %) 184 (26.6 %) 359 (27.6 %)
(14.6,25.6] 229 (37.7 %) 236 (34.1 %) 465 (35.7 %)
(25.6,Inf] 204 (33.6 %) 273 (39.4 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 75 (12.3 %) 68 (9.8 %) 143 (11.0 %)
(0.132,0.5] 273 (44.9 %) 330 (47.6 %) 603 (46.3 %)
(0.5,Inf] 260 (42.8 %) 295 (42.6 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 211 (34.7 %) 218 (31.5 %) 429 (33.0 %)
(0.132,1] 191 (31.4 %) 205 (29.6 %) 396 (30.4 %)
(1,Inf] 206 (33.9 %) 270 (39.0 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 175 (28.8 %) 191 (27.6 %) 366 (28.1 %)
(1.5,4] 227 (37.3 %) 244 (35.2 %) 471 (36.2 %)
(4,Inf] 206 (33.9 %) 258 (37.2 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 183 (30.1 %) 206 (29.7 %) 389 (29.9 %)
(1.5,3] 224 (36.8 %) 230 (33.2 %) 454 (34.9 %)
(3,Inf] 201 (33.1 %) 257 (37.1 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 174 (28.6 %) 239 (34.5 %) 413 (31.7 %)
(7,14.1] 216 (35.5 %) 191 (27.6 %) 407 (31.3 %)
(14.1,Inf] 218 (35.9 %) 263 (38.0 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 193 (31.7 %) 204 (29.4 %) 397 (30.5 %)
(3,7] 171 (28.1 %) 232 (33.5 %) 403 (31.0 %)
(7,Inf] 244 (40.1 %) 257 (37.1 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 149 (24.5 %) 195 (28.1 %) 344 (26.4 %)
(4.1,8.5] 251 (41.3 %) 265 (38.2 %) 516 (39.7 %)
(8.5,Inf] 208 (34.2 %) 233 (33.6 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 190 (31.2 %) 214 (30.9 %) 404 (31.1 %)
(6,8.5] 136 (22.4 %) 178 (25.7 %) 314 (24.1 %)
(8.5,Inf] 282 (46.4 %) 301 (43.4 %) 583 (44.8 %)
hs_cd_c_Log2 -3.99 (0.98) -3.95 (1.09) -3.97 (1.04)
hs_co_c_Log2 -2.37 (0.61) -2.32 (0.64) -2.34 (0.63)
hs_cs_c_Log2 0.44 (0.58) 0.44 (0.57) 0.44 (0.57)
hs_cu_c_Log2 9.81 (0.25) 9.84 (0.22) 9.83 (0.23)
hs_hg_c_Log2 -0.24 (1.59) -0.35 (1.75) -0.30 (1.68)
hs_mo_c_Log2 -0.32 (0.83) -0.31 (0.96) -0.32 (0.90)
hs_dde_cadj_Log2 4.63 (1.48) 4.70 (1.50) 4.67 (1.49)
hs_pcb153_cadj_Log2 3.47 (0.86) 3.63 (0.94) 3.56 (0.90)
hs_pcb170_cadj_Log2 -0.60 (3.22) -0.05 (2.77) -0.31 (3.00)
hs_dep_cadj_Log2 0.27 (3.16) 0.06 (3.25) 0.16 (3.21)
hs_pbde153_cadj_Log2 -4.66 (3.86) -4.40 (3.80) -4.53 (3.83)
hs_pfhxs_c_Log2 -1.62 (1.30) -1.53 (1.31) -1.57 (1.31)
hs_pfoa_c_Log2 0.60 (0.55) 0.62 (0.56) 0.61 (0.55)
hs_pfos_c_Log2 0.95 (1.15) 0.99 (1.08) 0.97 (1.11)
hs_prpa_cadj_Log2 -1.26 (3.96) -1.91 (3.68) -1.61 (3.82)
hs_mbzp_cadj_Log2 2.42 (1.23) 2.47 (1.22) 2.44 (1.22)
hs_mibp_cadj_Log2 5.54 (1.09) 5.39 (1.12) 5.46 (1.11)
hs_mnbp_cadj_Log2 4.77 (1.08) 4.60 (0.96) 4.68 (1.02)

By Year of Birth

combined_data$e3_yearbir_None <- as.factor(combined_data$e3_yearbir_None)

render_cont <- function(x) {
  with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}

render_cat <- function(x) {
  c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}

table1_formula_year <- ~ 
  hs_child_age_None + e3_sex_None + 
  hs_zbmi_who +
  h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
  hs_proc_meat_Ter +
  hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
  hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
  hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
  hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
  hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
  hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_yearbir_None

table1(
  table1_formula_year,
  data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
2003
(N=55)
2004
(N=107)
2005
(N=241)
2006
(N=256)
2007
(N=250)
2008
(N=379)
2009
(N=13)
TRUE
(N=1301)
hs_child_age_None 11.32 (0.47) 10.81 (0.38) 9.19 (0.57) 8.38 (0.41) 6.91 (0.51) 6.41 (0.31) 6.18 (0.15) 7.98 (1.61)
e3_sex_None
Male 25 (45.5 %) 46 (43.0 %) 121 (50.2 %) 108 (42.2 %) 128 (51.2 %) 177 (46.7 %) 3 (23.1 %) 608 (46.7 %)
Female 30 (54.5 %) 61 (57.0 %) 120 (49.8 %) 148 (57.8 %) 122 (48.8 %) 202 (53.3 %) 10 (76.9 %) 693 (53.3 %)
hs_zbmi_who 0.32 (1.15) 0.12 (1.17) 0.45 (1.11) 0.34 (1.11) 0.41 (1.22) 0.49 (1.28) 0.79 (1.14) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 35 (63.6 %) 65 (60.7 %) 92 (38.2 %) 81 (31.6 %) 90 (36.0 %) 138 (36.4 %) 5 (38.5 %) 506 (38.9 %)
(10.8,34.9] 15 (27.3 %) 28 (26.2 %) 69 (28.6 %) 42 (16.4 %) 38 (15.2 %) 75 (19.8 %) 3 (23.1 %) 270 (20.8 %)
(34.9,Inf] 5 (9.1 %) 14 (13.1 %) 80 (33.2 %) 133 (52.0 %) 122 (48.8 %) 166 (43.8 %) 5 (38.5 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 16 (29.1 %) 21 (19.6 %) 88 (36.5 %) 117 (45.7 %) 48 (19.2 %) 53 (14.0 %) 2 (15.4 %) 345 (26.5 %)
(2,6] 12 (21.8 %) 30 (28.0 %) 75 (31.1 %) 90 (35.2 %) 77 (30.8 %) 132 (34.8 %) 7 (53.8 %) 423 (32.5 %)
(6,Inf] 27 (49.1 %) 56 (52.3 %) 78 (32.4 %) 49 (19.1 %) 125 (50.0 %) 194 (51.2 %) 4 (30.8 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 16 (29.1 %) 38 (35.5 %) 60 (24.9 %) 62 (24.2 %) 39 (15.6 %) 70 (18.5 %) 6 (46.2 %) 291 (22.4 %)
(1.1,5.5] 19 (34.5 %) 36 (33.6 %) 103 (42.7 %) 103 (40.2 %) 103 (41.2 %) 153 (40.4 %) 4 (30.8 %) 521 (40.0 %)
(5.5,Inf] 20 (36.4 %) 33 (30.8 %) 78 (32.4 %) 91 (35.5 %) 108 (43.2 %) 156 (41.2 %) 3 (23.1 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 15 (27.3 %) 21 (19.6 %) 67 (27.8 %) 58 (22.7 %) 73 (29.2 %) 119 (31.4 %) 6 (46.2 %) 359 (27.6 %)
(14.6,25.6] 12 (21.8 %) 27 (25.2 %) 90 (37.3 %) 101 (39.5 %) 80 (32.0 %) 149 (39.3 %) 6 (46.2 %) 465 (35.7 %)
(25.6,Inf] 28 (50.9 %) 59 (55.1 %) 84 (34.9 %) 97 (37.9 %) 97 (38.8 %) 111 (29.3 %) 1 (7.7 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 6 (10.9 %) 14 (13.1 %) 20 (8.3 %) 21 (8.2 %) 22 (8.8 %) 57 (15.0 %) 3 (23.1 %) 143 (11.0 %)
(0.132,0.5] 38 (69.1 %) 45 (42.1 %) 140 (58.1 %) 153 (59.8 %) 94 (37.6 %) 129 (34.0 %) 4 (30.8 %) 603 (46.3 %)
(0.5,Inf] 11 (20.0 %) 48 (44.9 %) 81 (33.6 %) 82 (32.0 %) 134 (53.6 %) 193 (50.9 %) 6 (46.2 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 17 (30.9 %) 30 (28.0 %) 77 (32.0 %) 51 (19.9 %) 96 (38.4 %) 155 (40.9 %) 3 (23.1 %) 429 (33.0 %)
(0.132,1] 20 (36.4 %) 39 (36.4 %) 78 (32.4 %) 99 (38.7 %) 68 (27.2 %) 87 (23.0 %) 5 (38.5 %) 396 (30.4 %)
(1,Inf] 18 (32.7 %) 38 (35.5 %) 86 (35.7 %) 106 (41.4 %) 86 (34.4 %) 137 (36.1 %) 5 (38.5 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 6 (10.9 %) 26 (24.3 %) 38 (15.8 %) 37 (14.5 %) 91 (36.4 %) 162 (42.7 %) 6 (46.2 %) 366 (28.1 %)
(1.5,4] 36 (65.5 %) 41 (38.3 %) 80 (33.2 %) 92 (35.9 %) 83 (33.2 %) 134 (35.4 %) 5 (38.5 %) 471 (36.2 %)
(4,Inf] 13 (23.6 %) 40 (37.4 %) 123 (51.0 %) 127 (49.6 %) 76 (30.4 %) 83 (21.9 %) 2 (15.4 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 11 (20.0 %) 20 (18.7 %) 32 (13.3 %) 36 (14.1 %) 106 (42.4 %) 180 (47.5 %) 4 (30.8 %) 389 (29.9 %)
(1.5,3] 31 (56.4 %) 56 (52.3 %) 80 (33.2 %) 70 (27.3 %) 82 (32.8 %) 128 (33.8 %) 7 (53.8 %) 454 (34.9 %)
(3,Inf] 13 (23.6 %) 31 (29.0 %) 129 (53.5 %) 150 (58.6 %) 62 (24.8 %) 71 (18.7 %) 2 (15.4 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 33 (60.0 %) 58 (54.2 %) 73 (30.3 %) 55 (21.5 %) 68 (27.2 %) 119 (31.4 %) 7 (53.8 %) 413 (31.7 %)
(7,14.1] 13 (23.6 %) 24 (22.4 %) 88 (36.5 %) 76 (29.7 %) 81 (32.4 %) 123 (32.5 %) 2 (15.4 %) 407 (31.3 %)
(14.1,Inf] 9 (16.4 %) 25 (23.4 %) 80 (33.2 %) 125 (48.8 %) 101 (40.4 %) 137 (36.1 %) 4 (30.8 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 9 (16.4 %) 18 (16.8 %) 97 (40.2 %) 82 (32.0 %) 67 (26.8 %) 122 (32.2 %) 2 (15.4 %) 397 (30.5 %)
(3,7] 26 (47.3 %) 45 (42.1 %) 71 (29.5 %) 59 (23.0 %) 80 (32.0 %) 116 (30.6 %) 6 (46.2 %) 403 (31.0 %)
(7,Inf] 20 (36.4 %) 44 (41.1 %) 73 (30.3 %) 115 (44.9 %) 103 (41.2 %) 141 (37.2 %) 5 (38.5 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 16 (29.1 %) 17 (15.9 %) 87 (36.1 %) 96 (37.5 %) 51 (20.4 %) 74 (19.5 %) 3 (23.1 %) 344 (26.4 %)
(4.1,8.5] 13 (23.6 %) 38 (35.5 %) 99 (41.1 %) 103 (40.2 %) 104 (41.6 %) 157 (41.4 %) 2 (15.4 %) 516 (39.7 %)
(8.5,Inf] 26 (47.3 %) 52 (48.6 %) 55 (22.8 %) 57 (22.3 %) 95 (38.0 %) 148 (39.1 %) 8 (61.5 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 19 (34.5 %) 26 (24.3 %) 75 (31.1 %) 63 (24.6 %) 81 (32.4 %) 136 (35.9 %) 4 (30.8 %) 404 (31.1 %)
(6,8.5] 11 (20.0 %) 25 (23.4 %) 53 (22.0 %) 71 (27.7 %) 55 (22.0 %) 95 (25.1 %) 4 (30.8 %) 314 (24.1 %)
(8.5,Inf] 25 (45.5 %) 56 (52.3 %) 113 (46.9 %) 122 (47.7 %) 114 (45.6 %) 148 (39.1 %) 5 (38.5 %) 583 (44.8 %)
hs_cd_c_Log2 -4.32 (1.45) -3.93 (1.01) -3.90 (1.15) -3.91 (1.00) -3.90 (0.88) -4.05 (0.97) -4.09 (1.85) -3.97 (1.04)
hs_co_c_Log2 -2.35 (0.51) -2.38 (0.59) -2.54 (0.60) -2.45 (0.68) -2.27 (0.58) -2.20 (0.62) -2.11 (0.55) -2.34 (0.63)
hs_cs_c_Log2 1.04 (0.47) 1.03 (0.50) 0.71 (0.43) 0.65 (0.43) 0.18 (0.50) 0.07 (0.45) -0.04 (0.40) 0.44 (0.57)
hs_cu_c_Log2 9.83 (0.18) 9.89 (0.30) 9.79 (0.21) 9.76 (0.22) 9.82 (0.22) 9.88 (0.23) 9.86 (0.26) 9.83 (0.23)
hs_hg_c_Log2 0.56 (1.21) 0.73 (1.27) 0.41 (1.35) 0.12 (1.35) -0.78 (1.73) -1.11 (1.70) -0.75 (1.41) -0.30 (1.68)
hs_mo_c_Log2 -0.85 (1.80) -0.43 (0.64) -0.45 (0.83) -0.32 (0.81) -0.26 (0.83) -0.16 (0.88) -0.27 (0.94) -0.32 (0.90)
hs_dde_cadj_Log2 4.06 (1.34) 3.90 (1.09) 4.23 (1.24) 4.32 (1.02) 4.96 (1.66) 5.28 (1.62) 5.16 (1.23) 4.67 (1.49)
hs_pcb153_cadj_Log2 3.54 (0.73) 3.47 (0.72) 3.80 (0.84) 4.03 (0.79) 3.32 (0.95) 3.25 (0.87) 3.69 (0.96) 3.56 (0.90)
hs_pcb170_cadj_Log2 0.48 (1.19) 0.17 (2.28) 0.64 (2.35) 1.08 (1.77) -1.46 (3.58) -1.32 (3.29) -0.92 (3.65) -0.31 (3.00)
hs_dep_cadj_Log2 -0.62 (3.08) 0.10 (3.10) 0.13 (3.31) 0.19 (2.90) 0.65 (3.10) -0.02 (3.42) -0.25 (3.57) 0.16 (3.21)
hs_pbde153_cadj_Log2 -4.43 (3.35) -5.26 (3.63) -4.26 (3.77) -3.68 (3.60) -4.30 (3.95) -5.20 (3.93) -5.08 (3.90) -4.53 (3.83)
hs_pfhxs_c_Log2 -0.58 (0.66) -0.53 (0.85) -1.01 (0.99) -1.07 (0.88) -2.03 (1.40) -2.37 (1.22) -2.68 (0.71) -1.57 (1.31)
hs_pfoa_c_Log2 0.53 (0.44) 0.55 (0.53) 0.67 (0.49) 0.64 (0.51) 0.66 (0.63) 0.55 (0.58) 0.35 (0.50) 0.61 (0.55)
hs_pfos_c_Log2 1.67 (0.68) 1.55 (0.72) 1.18 (1.00) 1.10 (1.14) 0.80 (1.05) 0.62 (1.17) 0.40 (0.94) 0.97 (1.11)
hs_prpa_cadj_Log2 -3.21 (3.68) -2.67 (3.04) -0.92 (3.92) -1.65 (3.89) -1.57 (3.80) -1.58 (3.80) 0.87 (4.50) -1.61 (3.82)
hs_mbzp_cadj_Log2 2.69 (1.08) 2.84 (1.11) 2.43 (1.19) 2.32 (1.14) 2.36 (1.34) 2.45 (1.24) 2.21 (1.51) 2.44 (1.22)
hs_mibp_cadj_Log2 5.19 (1.08) 5.59 (1.07) 4.84 (0.97) 4.82 (0.98) 5.84 (1.00) 6.02 (0.93) 6.27 (0.97) 5.46 (1.11)
hs_mnbp_cadj_Log2 3.99 (0.86) 4.34 (0.88) 4.26 (0.90) 4.52 (0.90) 4.85 (0.95) 5.11 (1.05) 5.22 (0.77) 4.68 (1.02)

Variable Selection

When selecting variables, elastic net will be applied into the available diet and chemical variables in the HELIX data. Elastic net was utilized for variable selection and further analysis.

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
  "hs_as_c_Log2",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mn_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_tl_cdich_None",
  "hs_dde_cadj_Log2",
  "hs_ddt_cadj_Log2",
  "hs_hcb_cadj_Log2",
  "hs_pcb118_cadj_Log2",
  "hs_pcb138_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_pcb180_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_dmdtp_cdich_None",
  "hs_dmp_cadj_Log2",
  "hs_dmtp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pbde47_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfna_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_pfunda_c_Log2",
  "hs_bpa_cadj_Log2",
  "hs_bupa_cadj_Log2",
  "hs_etpa_cadj_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_trcs_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mecpp_cadj_Log2",
  "hs_mehhp_cadj_Log2",
  "hs_mehp_cadj_Log2",
  "hs_meohp_cadj_Log2",
  "hs_mep_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2",
  "hs_ohminp_cadj_Log2",
  "hs_oxominp_cadj_Log2",
  "hs_cotinine_cdich_None",
  "hs_globalexp2_None"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_beverages_Ter",
  "hs_break_cer_Ter",
  "hs_caff_drink_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_cereal_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_meat_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_total_yog_Ter"
)

chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))

diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))

all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)

diet_outcome_cov <- cbind(outcome_cov, all_diet)

interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)

Chemicals Data

Chemicals will be analyzed for the best variables using enet methods.

# train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)

x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]

x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]

x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])
# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)
test_rmse_without_covariates <- sqrt(test_mse_without_covariates)
plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                  s1
## (Intercept)            -4.785950188
## hs_as_c_Log2            .          
## hs_cd_c_Log2           -0.025843356
## hs_co_c_Log2           -0.005835867
## hs_cs_c_Log2            0.084715330
## hs_cu_c_Log2            0.607379616
## hs_hg_c_Log2           -0.009800093
## hs_mn_c_Log2            .          
## hs_mo_c_Log2           -0.099724922
## hs_pb_c_Log2           -0.010318890
## hs_tl_cdich_None        .          
## hs_dde_cadj_Log2       -0.039528137
## hs_ddt_cadj_Log2        .          
## hs_hcb_cadj_Log2        .          
## hs_pcb118_cadj_Log2     .          
## hs_pcb138_cadj_Log2     .          
## hs_pcb153_cadj_Log2    -0.169008355
## hs_pcb170_cadj_Log2    -0.055808065
## hs_pcb180_cadj_Log2     .          
## hs_dep_cadj_Log2       -0.019034348
## hs_detp_cadj_Log2       .          
## hs_dmdtp_cdich_None     .          
## hs_dmp_cadj_Log2        .          
## hs_dmtp_cadj_Log2       .          
## hs_pbde153_cadj_Log2   -0.035464586
## hs_pbde47_cadj_Log2     .          
## hs_pfhxs_c_Log2        -0.006816020
## hs_pfna_c_Log2          .          
## hs_pfoa_c_Log2         -0.135997766
## hs_pfos_c_Log2         -0.047692264
## hs_pfunda_c_Log2        .          
## hs_bpa_cadj_Log2        .          
## hs_bupa_cadj_Log2       .          
## hs_etpa_cadj_Log2       .          
## hs_mepa_cadj_Log2       .          
## hs_oxbe_cadj_Log2       0.002529961
## hs_prpa_cadj_Log2       0.001735800
## hs_trcs_cadj_Log2       .          
## hs_mbzp_cadj_Log2       0.040317847
## hs_mecpp_cadj_Log2      .          
## hs_mehhp_cadj_Log2      .          
## hs_mehp_cadj_Log2       .          
## hs_meohp_cadj_Log2      .          
## hs_mep_cadj_Log2        .          
## hs_mibp_cadj_Log2      -0.047892677
## hs_mnbp_cadj_Log2      -0.008483913
## hs_ohminp_cadj_Log2     .          
## hs_oxominp_cadj_Log2    .          
## hs_cotinine_cdich_None  .          
## hs_globalexp2_None      .
cat("Model without Covariates - Test RMSE:", test_rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.108515
#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")

The features for chemicals were selected due to the feature selections of elastic net. LASSO might simplify the dimensionality, so elastic net was chosen since feature importance is uncertain.

Postnatal Diet Data

Like with the chemical variables, the postnatal diet of children will be analyzed for the best variables using the regression methods.

# train/test
set.seed(101)  
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)

diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])  
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])  

covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ]) 
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)

x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0

y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min 0.1384     9   1.431 0.06031       5
## 1se 0.2914     1   1.442 0.06168       0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                         s1
## (Intercept)                     0.52746936
## h_bfdur_Ter(0,10.8]             .         
## h_bfdur_Ter(10.8,34.9]          .         
## h_bfdur_Ter(34.9,Inf]           .         
## hs_bakery_prod_Ter(2,6]         .         
## hs_bakery_prod_Ter(6,Inf]       .         
## hs_beverages_Ter(0.132,1]       .         
## hs_beverages_Ter(1,Inf]         .         
## hs_break_cer_Ter(1.1,5.5]       .         
## hs_break_cer_Ter(5.5,Inf]       .         
## hs_caff_drink_Ter(0.132,Inf]    .         
## hs_dairy_Ter(14.6,25.6]         .         
## hs_dairy_Ter(25.6,Inf]          .         
## hs_fastfood_Ter(0.132,0.5]      .         
## hs_fastfood_Ter(0.5,Inf]        .         
## hs_org_food_Ter(0.132,1]        .         
## hs_org_food_Ter(1,Inf]         -0.12911952
## hs_proc_meat_Ter(1.5,4]         .         
## hs_proc_meat_Ter(4,Inf]         .         
## hs_readymade_Ter(0.132,0.5]     .         
## hs_readymade_Ter(0.5,Inf]       .         
## hs_total_bread_Ter(7,17.5]      .         
## hs_total_bread_Ter(17.5,Inf]    .         
## hs_total_cereal_Ter(14.1,23.6]  .         
## hs_total_cereal_Ter(23.6,Inf]   .         
## hs_total_fish_Ter(1.5,3]        .         
## hs_total_fish_Ter(3,Inf]        .         
## hs_total_fruits_Ter(7,14.1]     .         
## hs_total_fruits_Ter(14.1,Inf]  -0.02484832
## hs_total_lipids_Ter(3,7]        .         
## hs_total_lipids_Ter(7,Inf]     -0.05017966
## hs_total_meat_Ter(6,9]          .         
## hs_total_meat_Ter(9,Inf]        .         
## hs_total_potatoes_Ter(3,4]      .         
## hs_total_potatoes_Ter(4,Inf]    .         
## hs_total_sweets_Ter(4.1,8.5]   -0.01453272
## hs_total_sweets_Ter(8.5,Inf]    .         
## hs_total_veg_Ter(6,8.5]         .         
## hs_total_veg_Ter(8.5,Inf]      -0.07850581
## hs_total_yog_Ter(6,8.5]         .         
## hs_total_yog_Ter(8.5,Inf]       .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)
rmse_without_covariates <- sqrt(mse_without_covariates)
cat("Model without Covariates - Test RMSE:", rmse_without_covariates)
## Model without Covariates - Test RMSE: 1.161644
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

Statistical Models

To analyze the impact of dietary, chemical, and demographic variables on BMI Z-scores, various statistical models were employed, each chosen for their unique strengths in handling different aspects of the data. These models were analyzed and performed with a 70-30% split in order to train models and predict the performance

LASSO (Least Absolute Shrinkage and Selection Operator) applies L1 regularization to minimize the absolute sum of coefficients. It was used to perform variable selection and regularization, effectively identifying the most significant predictors while setting less important ones to zero.

Ridge regression is particularly useful for handling multicollinearity among predictors, ensuring that all variables contribute to the prediction without being overly penalized.

Elastic net balances the benefits of both LASSO ridge, so by handling both variable selection and multicollinearity, elastic net is well-suited for high-dimensional datasets where predictors are correlated.

Since the data gets correlated with each other when combined (adding diet, chemicals and the metabolomic serum), group LASSO, ridge, and elastic net were applied.

Decision tree (with pruning) splits the data into subsets based on the value of input features, with pruning applied to prevent overfitting. This provides an interpretable structure for understanding the relationships between variables and the outcome, though it can be prone to overfitting without pruning. Pruning gets applied to enhance generalizability.

Random forest constructs multiple decision trees and merges them to obtain a more accurate and stable prediction. By averaging the predictions from numerous trees, this model reduces overfitting and captures complex interactions among variables.

Gradient Boosting Machine (GBM) builds an ensemble of trees in a sequential manner, where each tree corrects the errors of its predecessors. This approach is highly effective in improving predictive accuracy by focusing on the residuals of previous trees, making it powerful for capturing non-linear relationships.

Thes ensemble methods (razndom forest and GBM) were chosen for their robustness and high predictive accuracy. Random forest reduces variance by averaging multiple trees, while GBM improves model performance through iterative refinement.

In this study, the Root Mean Squared Error (RMSE) is used as the primary performance metric to evaluate and compare the predictive models. Using RMSE allows for straightforward comparisons between different models and datasets. Since RMSE is in the same units as the outcome variable, it facilitates direct assessment of how well different models perform in predicting BMI Z-scores, making it easier to identify the model that provides the most accurate predictions.

By employing a diverse set of models, this analysis aims to identify the most significant predictors of BMI Z-scores and understand the complex interactions between dietary intake, chemical exposures, and metabolomic serum data. The combination of regularization techniques and ensemble methods ensures a comprehensive and reliable assessment of the data.

Results

#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)
combined_data_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]

finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))

finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)

find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
  cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA  
  cor_matrix <- as.data.frame(as.table(cor_matrix)) 
  cor_matrix <- na.omit(cor_matrix)  
  cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]  
  cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)  
  return(cor_matrix)
}

highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs

The correlation plot for the selected variables indicates notable multicollinearity among various chemical variables and the child age covariate. Using grouped regression models like LASSO, ridge, and elastic net allows for the collective handling of these highly correlated variables. This approach helps with overfitting.

Baseline (Covariates)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))

x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who
train_control <- trainControl(method = "cv", number = 10)

set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## e3_sex_Nonefemale    .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.152017
lasso_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)

cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.152017

As λ increases, the penalty for the coefficients becomes stronger, driving some coefficients to zero. The optimal λ (lambda.min) is chosen as the one that minimizes the cross-validated MSE. The coefficients listed show the non-zero variables selected by LASSO. In this case, children’s age is selected as significant.

Ridge

set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          8.732635e-01
## hs_child_age_None   -5.738585e-02
## e3_sex_Nonefemale   -3.110211e-38
## e3_sex_Nonemale      3.110211e-38
## e3_yearbir_None2004 -1.463782e-37
## e3_yearbir_None2005  2.704907e-38
## e3_yearbir_None2006 -9.332751e-39
## e3_yearbir_None2007 -5.445405e-38
## e3_yearbir_None2008  3.129795e-38
## e3_yearbir_None2009  3.620840e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.152017
ridge_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)

cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.149128

The coefficients show that age has the most significant effect, similar to LASSO, with other variables having very small coefficients. The λ was optimized using the same cross-validation approach as LASSO. Ridge regression performs similarly to LASSO in terms of RMSE.

Elastic Net

set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## e3_sex_Nonefemale    .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.152017
enet_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)

cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.152017

Elastic Net uses cross-validation to find the best balance between LASSO and Ridge penalties. The Elastic Net RMSE is the same as LASSO, indicating similar predictive performance.

Decision Tree

set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_child_age_None
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.011581      0   1.00000 1.0026 0.051032
## 2 0.010000      1   0.98842 1.0130 0.051854
plotcp(fit_tree)

# getting the optimal cp value that minimizes the cross-validation error
optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]
cat("Optimal cp value:", optimal_cp, "\n")
## Optimal cp value: 0.01158106
pruned_tree <- prune(fit_tree, cp = optimal_cp)
rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Baseline Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Baseline Pruned Decision Tree RMSE: 1.148665
set.seed(101)

train_control <- trainControl(method = "cv", number = 10)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))

pruned_tree_model <- train(
  hs_zbmi_who ~ .,
  data = train_data,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.007
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))

rpart.plot(fit_tree_unpruned)
rpart.plot(fit_tree_best)

unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)

mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)

mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.155427
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.155427

The decision tree plot shows a simple model with age as the primary split. The complexity parameter (cp) controls tree pruning. The optimal cp was determined by cross-validation, selecting the cp that minimized cross-validation error. The RMSE shows that pruning improved the model slightly but was not as effective as regularization methods.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)

rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.157731
rf_model <- train(
  x_train, y_train,
  method = "rf",
  trControl = train_control,
  tuneLength = 10
)
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)

cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.154789

Age is the most important variable, followed by year of birth. Random Forest performs similarly to other models in terms of RMSE.

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.14888
gbm_model <- train(
  x_train, y_train,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10,
  verbose = FALSE
)

gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)

cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.149695

Age has the highest relative influence. GBM shows slightly better performance compared to some other models.

Model Comparison:

LASSO, Ridge, and Elastic Net: Provide similar predictive performance, with all models effectively handling multicollinearity.

Decision Tree: Simple and interpretable, but less effective than regularization methods.

Random Forest and GBM: Offer competitive performance, with GBM slightly better in RMSE.

Diet + Covariates

diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")

diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]

train_control <- trainControl(method = "cv", number = 10)

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## e3_sex_Nonefemale              .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.138756
lasso_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)

cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.139468

The coefficient paths plot for LASSO regression (shown above) displays the changes in the coefficients of the predictors as the regularization parameter λ varies. The vertical dashed lines indicate the optimal λ values chosen by cross-validation to minimize the Mean Squared Error (MSE). As λ increases, more coefficients are shrunk to zero, resulting in a simpler model.

The optimal λ is selected based on the 10-fold cross-validation procedure, which evaluates the model performance on different subsets of the data. The λ value corresponding to the lowest cross-validated error is chosen to ensure the best predictive performance while preventing overfitting.

The LASSO model for diet variables addition shows an RMSE of approximately 1.14, indicating the average error between the predicted and actual BMI Z-scores. The LASSO model identified child’s age as a significant predictor with a non-zero coefficient. Most diet-related variables were shrunk to zero, indicating their lesser impact when combined with covariates.

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    8.732635e-01
## hs_child_age_None             -5.738585e-02
## e3_sex_Nonefemale             -3.392957e-38
## e3_sex_Nonemale                3.392957e-38
## e3_yearbir_None2004           -1.596853e-37
## e3_yearbir_None2005            2.950807e-38
## e3_yearbir_None2006           -1.018118e-38
## e3_yearbir_None2007           -5.940442e-38
## e3_yearbir_None2008            3.414322e-38
## e3_yearbir_None2009            3.950007e-37
## h_bfdur_Ter(10.8,34.9]         1.959443e-37
## h_bfdur_Ter(34.9,Inf]         -1.196150e-37
## hs_bakery_prod_Ter(2,6]        2.102760e-37
## hs_bakery_prod_Ter(6,Inf]     -1.896677e-37
## hs_break_cer_Ter(1.1,5.5]      9.801144e-39
## hs_break_cer_Ter(5.5,Inf]     -2.286856e-37
## hs_dairy_Ter(14.6,25.6]        5.634165e-38
## hs_dairy_Ter(25.6,Inf]         8.216707e-41
## hs_fastfood_Ter(0.132,0.5]     1.215262e-37
## hs_fastfood_Ter(0.5,Inf]      -6.235230e-38
## hs_org_food_Ter(0.132,1]       8.146218e-38
## hs_org_food_Ter(1,Inf]        -1.900266e-37
## hs_proc_meat_Ter(1.5,4]        1.193171e-37
## hs_proc_meat_Ter(4,Inf]       -1.994408e-38
## hs_total_fish_Ter(1.5,3]      -8.902720e-38
## hs_total_fish_Ter(3,Inf]       2.141752e-38
## hs_total_fruits_Ter(7,14.1]    2.168036e-37
## hs_total_fruits_Ter(14.1,Inf] -2.040600e-37
## hs_total_lipids_Ter(3,7]      -9.170321e-39
## hs_total_lipids_Ter(7,Inf]    -2.153479e-37
## hs_total_sweets_Ter(4.1,8.5]  -7.677763e-38
## hs_total_sweets_Ter(8.5,Inf]  -1.716694e-38
## hs_total_veg_Ter(6,8.5]        2.132182e-38
## hs_total_veg_Ter(8.5,Inf]     -1.282604e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.130793
ridge_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)

cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.129036

The coefficient paths plot for Ridge regression shows the changes in the coefficients as λ varies. Unlike LASSO, Ridge regression shrinks the coefficients but does not set them to zero. This results in all predictors contributing to the model, though some may have very small coefficients.

The Ridge model with the addition of diet variables shows an RMSE of around 1.13, suggesting slightly better performance compared to the LASSO model.

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## e3_sex_Nonefemale              .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.136809
enet_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)

cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.138767

The coefficient paths for Elastic Net combine the features of both LASSO and Ridge regressions, balancing between setting some coefficients to zero (like LASSO) and shrinking others (like Ridge). The paths display the progression of coefficients as λ varies.

The Elastic Net model for Diet + Covariates also performs well, with an RMSE close to 1.13, comparable to the Ridge regression performance.

Decision Tree

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Diet + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Covariates Decision Tree RMSE: 1.152094
printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_bakery_prod_Ter  hs_break_cer_Ter    hs_child_age_None  
## [4] hs_total_lipids_Ter
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.011773      0   1.00000 1.0010 0.050839
## 2 0.010484      3   0.96468 1.0155 0.052218
## 3 0.010000      4   0.95420 1.0156 0.052011
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Diet + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Covariates Decision Tree RMSE: 1.148665
set.seed(101)

train_control <- trainControl(method = "cv", number = 10)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))

pruned_tree_model <- train(
  hs_zbmi_who ~ .,
  data = train_data,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.1
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))

rpart.plot(fit_tree_unpruned)

rpart.plot(fit_tree_best)

unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)

mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)

mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.152094
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.148665

The decision tree split on child’s age first, indicating it is the most important variable. Subsequent splits involve dietary variables like bakery products, cereal, and total lipids.

The tree was pruned using the complexity parameter (cp) to avoid overfitting. The optimal cp was found using cross-validation.

Pruning slightly improved the model’s performance by reducing overfitting.

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                     IncNodePurity
## hs_child_age_None       233.57368
## e3_sex_None              34.19375
## e3_yearbir_None          97.75201
## h_bfdur_Ter              65.45676
## hs_bakery_prod_Ter       64.03655
## hs_break_cer_Ter         65.99478
## hs_dairy_Ter             69.72797
## hs_fastfood_Ter          55.45883
## hs_org_food_Ter          65.89629
## hs_proc_meat_Ter         68.85280
## hs_total_fish_Ter        63.72114
## hs_total_fruits_Ter      64.85756
## hs_total_lipids_Ter      65.79002
## hs_total_sweets_Ter      66.74543
## hs_total_veg_Ter         69.89130
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.139899
rf_model <- train(
  x_train, y_train,
  method = "rf",
  trControl = train_control,
  tuneLength = 10
)

rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)

cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.129478

Random forest identified child’s age and year of birth as the most important variables. The model provided a robust prediction by aggregating multiple decision trees, reducing variance, and handling nonlinear relationships well.

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.137923
gbm_model <- train(
  x_train, y_train,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10,
  verbose = FALSE
)

gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)

cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.136542

GBM also highlighted age as the most influential variable, followed by birth year and several diet variables. The number of trees was optimized using cross-validation, indicated by the vertical line in the error plot. GBM showed strong predictive performance by combining boosting with decision trees, focusing on reducing bias and variance.

Model Comparison:

The inclusion of diet variables generally improved the model performance, highlighting the importance of considering dietary factors alongside traditional covariates in predicting BMI Z-score. All models consistently identified child’s age as a significant predictor. Regularization techniques like ridge and elastic net performed well, handling multicollinearity and ensuring robust predictions. The decision tree and random forest models highlighted the importance of specific dietary variables in combination with covariates. GBM provided a comprehensive approach by effectively combining multiple weak learners, optimizing performance.

Chemicals + Covariates

chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]

train_control <- trainControl(method = "cv", number = 10)

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          -1.201642945
## hs_child_age_None    -0.042729967
## e3_sex_Nonefemale     .          
## e3_sex_Nonemale       .          
## e3_yearbir_None2004  -0.040205226
## e3_yearbir_None2005   .          
## e3_yearbir_None2006   .          
## e3_yearbir_None2007   .          
## e3_yearbir_None2008   .          
## e3_yearbir_None2009   .          
## hs_cd_c_Log2         -0.011319810
## hs_co_c_Log2          .          
## hs_cs_c_Log2          .          
## hs_cu_c_Log2          0.260654501
## hs_hg_c_Log2          .          
## hs_mo_c_Log2         -0.069199903
## hs_pb_c_Log2          .          
## hs_dde_cadj_Log2     -0.041339092
## hs_pcb153_cadj_Log2  -0.125818759
## hs_pcb170_cadj_Log2  -0.043143964
## hs_dep_cadj_Log2     -0.003677983
## hs_pbde153_cadj_Log2 -0.036720000
## hs_pfhxs_c_Log2       .          
## hs_pfoa_c_Log2       -0.097979410
## hs_pfos_c_Log2        .          
## hs_prpa_cadj_Log2     .          
## hs_mbzp_cadj_Log2     0.007077421
## hs_mibp_cadj_Log2    -0.025837249
## hs_mnbp_cadj_Log2    -0.004654193
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.040557
lasso_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

lasso_predictions_cv <- predict(lasso_model, x_test)
mse_lasso_cv <- mean((y_test - lasso_predictions_cv)^2)
rmse_lasso_cv <- sqrt(mse_lasso_cv)

cat("10-Fold CV Lasso RMSE:", rmse_lasso_cv)
## 10-Fold CV Lasso RMSE: 1.040727

The optimal lambda (λ) is chosen by cross-validation, which minimizes the Mean Squared Error (MSE). The optimal λ is identified near log(λ) ≈ -2.5, where the MSE is the lowest. Child age, cobalt (hs_co_c_Log2), and other chemical-related variables like copper (hs_cu_c_Log2) and mercury (hs_hg_c_Log2) have significant coefficients.

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          -1.609471669
## hs_child_age_None    -0.054184972
## e3_sex_Nonefemale    -0.023174163
## e3_sex_Nonemale       0.023148148
## e3_yearbir_None2004  -0.128309522
## e3_yearbir_None2005   0.028409105
## e3_yearbir_None2006   0.053001523
## e3_yearbir_None2007  -0.027948121
## e3_yearbir_None2008  -0.015135402
## e3_yearbir_None2009   0.235538660
## hs_cd_c_Log2         -0.036089222
## hs_co_c_Log2         -0.030180059
## hs_cs_c_Log2          0.066973465
## hs_cu_c_Log2          0.319809539
## hs_hg_c_Log2         -0.014578657
## hs_mo_c_Log2         -0.071569755
## hs_pb_c_Log2         -0.039355400
## hs_dde_cadj_Log2     -0.048885680
## hs_pcb153_cadj_Log2  -0.115503946
## hs_pcb170_cadj_Log2  -0.036730018
## hs_dep_cadj_Log2     -0.011840831
## hs_pbde153_cadj_Log2 -0.027003858
## hs_pfhxs_c_Log2      -0.026465601
## hs_pfoa_c_Log2       -0.101071688
## hs_pfos_c_Log2       -0.025692420
## hs_prpa_cadj_Log2     0.001538225
## hs_mbzp_cadj_Log2     0.039172212
## hs_mibp_cadj_Log2    -0.035550506
## hs_mnbp_cadj_Log2    -0.039493899
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.035505
ridge_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

ridge_predictions_cv <- predict(ridge_model, x_test)
mse_ridge_cv <- mean((y_test - ridge_predictions_cv)^2)
rmse_ridge_cv <- sqrt(mse_ridge_cv)

cat("10-Fold CV Ridge RMSE:", rmse_ridge_cv, "\n")
## 10-Fold CV Ridge RMSE: 1.035411

The optimal lambda is found around log(λ) ≈ 0, where MSE is minimized. Multiple variables, including age, birth year, and various chemical variables, have significant coefficients.

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          -1.30722494
## hs_child_age_None    -0.04475073
## e3_sex_Nonefemale     .         
## e3_sex_Nonemale       .         
## e3_yearbir_None2004  -0.06251662
## e3_yearbir_None2005   .         
## e3_yearbir_None2006   .         
## e3_yearbir_None2007   .         
## e3_yearbir_None2008   .         
## e3_yearbir_None2009   .         
## hs_cd_c_Log2         -0.01538343
## hs_co_c_Log2          .         
## hs_cs_c_Log2          .         
## hs_cu_c_Log2          0.27721103
## hs_hg_c_Log2          .         
## hs_mo_c_Log2         -0.07149465
## hs_pb_c_Log2          .         
## hs_dde_cadj_Log2     -0.04498749
## hs_pcb153_cadj_Log2  -0.12563741
## hs_pcb170_cadj_Log2  -0.04301914
## hs_dep_cadj_Log2     -0.00523836
## hs_pbde153_cadj_Log2 -0.03600515
## hs_pfhxs_c_Log2       .         
## hs_pfoa_c_Log2       -0.10302534
## hs_pfos_c_Log2        .         
## hs_prpa_cadj_Log2     .         
## hs_mbzp_cadj_Log2     0.01485361
## hs_mibp_cadj_Log2    -0.02916026
## hs_mnbp_cadj_Log2    -0.01195464
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.040324
enet_model <- train(
  x_train, y_train,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), lambda = seq(0.001, 1, length = 100)),
  trControl = train_control,
  penalty.factor = penalty_factors
)

enet_predictions_cv <- predict(enet_model, x_test)
mse_enet_cv <- mean((y_test - enet_predictions_cv)^2)
rmse_enet_cv <- sqrt(mse_enet_cv)

cat("10-Fold CV Elastic Net RMSE:", rmse_enet_cv, "\n")
## 10-Fold CV Elastic Net RMSE: 1.035456

Elastic Net combines LASSO and Ridge penalties, leading to variable selection and coefficient shrinkage. The optimal lambda is chosen near log(λ) ≈ -2.5.

Variables:

  • hs_child_age_None: -0.0447

  • hs_co_c_Log2: 0.2772 (positive association)

  • hs_pfoa_c_Log2: -0.1030

  • hs_cu_c_Log2: 0.0154 (slight positive association)

  • hs_pcb170_cadj_Log2: -0.1256

Elastic Net confirms the significance of certain chemicals like hs_co_c and hs_cu_c while also identifying the negative association of hs_pfoa_c.

Decision Tree

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Chemical + Covariates Decision Tree RMSE: 1.108581
printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2         hs_dde_cadj_Log2     hs_mbzp_cadj_Log2   
## [4] hs_mnbp_cadj_Log2    hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2 
## [7] hs_pfoa_c_Log2      
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.077388      0   1.00000 1.00089 0.050861
## 2  0.032769      1   0.92261 0.93652 0.048377
## 3  0.032579      2   0.88984 0.91682 0.046750
## 4  0.025008      3   0.85726 0.91218 0.046570
## 5  0.014062      4   0.83226 0.89795 0.045241
## 6  0.013674      6   0.80413 0.89837 0.046138
## 7  0.013602      7   0.79046 0.89907 0.046094
## 8  0.013517      8   0.77686 0.89907 0.046094
## 9  0.011314      9   0.76334 0.90679 0.046317
## 10 0.011047     11   0.74071 0.91227 0.047148
## 11 0.010329     12   0.72967 0.92524 0.047400
## 12 0.010000     14   0.70901 0.92466 0.047181
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)

train_control <- trainControl(method = "cv", number = 10)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))

pruned_tree_model <- train(
  hs_zbmi_who ~ .,
  data = train_data,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)

rpart.plot(fit_tree_best)

unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)

mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)

mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.108581
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746

The decision tree diagram shows the splits based on different chemical covariates. The root node splits on Polychlorinated biphenyl-170 (hs_pcb170_cadj_Log2), indicating its high importance.

Variables such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are prominent, as they appear at the top levels of the tree.

The unpruned decision tree has an RMSE of 1.186, and after pruning using the optimal complexity parameter (cp), the RMSE is slightly reduced to 1.090, indicating improved model performance.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        48.822971
## e3_sex_None               6.356073
## e3_yearbir_None          34.327078
## hs_cd_c_Log2             49.069793
## hs_co_c_Log2             51.682060
## hs_cs_c_Log2             42.322759
## hs_cu_c_Log2             66.200851
## hs_hg_c_Log2             53.722138
## hs_mo_c_Log2             59.137163
## hs_pb_c_Log2             46.948933
## hs_dde_cadj_Log2         81.399724
## hs_pcb153_cadj_Log2      86.076191
## hs_pcb170_cadj_Log2     132.902458
## hs_dep_cadj_Log2         55.402055
## hs_pbde153_cadj_Log2    100.134593
## hs_pfhxs_c_Log2          44.500787
## hs_pfoa_c_Log2           60.306554
## hs_pfos_c_Log2           51.809387
## hs_prpa_cadj_Log2        50.765210
## hs_mbzp_cadj_Log2        57.448789
## hs_mibp_cadj_Log2        45.657870
## hs_mnbp_cadj_Log2        53.036576
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.031425
rf_model <- train(
  x_train, y_train,
  method = "rf",
  trControl = train_control,
  tuneLength = 10
)

rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)

cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.031669

The random forest model’s variable importance plot highlights the most influential variables based on the increase in node purity. hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are the top contributors.

The random forest model achieves an RMSE of 1.032, demonstrating its robust predictive performance due to the ensemble of decision trees.

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.076264
gbm_model <- train(
  x_train, y_train,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10,
  verbose = FALSE
)

gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)

cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.040439

The GBM model’s relative influence plot shows that hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 have the highest relative influence on the model’s predictions.

The GBM model has an RMSE of 1.076, and the cross-validated RMSE is 1.040, indicating good predictive accuracy and generalization to unseen data.

Model Comparison:

LASSO, Ridge, and Elastic Net Models: These models effectively identify significant chemical covariates, with optimization achieved through cross-validation to select the best λ. Chemical-related variables (hs_cd_c_Log2, hs_co_c_Log2, etc.) also show significance, indicating their influence on the outcome. Ridge regression and elastic net provide lower RMSE values, suggesting better generalization compared to LASSO.

Decision Tree: Provides an intuitive model with key splits on important chemical covariates but may overfit without pruning.

Random Forest: Offers robust performance by aggregating multiple trees, highlighting key variables and achieving low RMSE.

GBM: Combines boosting with tree-based methods to enhance predictive accuracy, with clear identification of influential variables.

Diet + Chemical + Covariates

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
  "h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
  "hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
  "hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)

chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))

set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]

train_control <- trainControl(method = "cv", number = 10)

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)

# make the group_indices vector
group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals)       # Group 3: Chemicals
)

# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # should be TRUE
## [1] TRUE

Group LASSO

set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.042529
set.seed(101)
cv_group_lasso <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.02299869
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")

sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]

sig_vars_lasso_df <- data.frame(
  Variable = names(sig_vars_lasso),
  Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_test, s = "lambda.min", type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 1.031241

The cross-validated RMSE is 1.031, showing consistency in performance across different data subsets.

Significant Variables:

  • e3_yearbir_None2009: The positive coefficient (0.349) suggests that being born in 2009 is associated with an increase in the BMI Z-score.

  • hs_cu_c_Log2: The positive coefficient (0.478) suggests that higher levels of copper are strongly associated with an increase in the BMI Z-score.

  • hs_cd_c_Log2: The negative coefficient (-0.047) indicates that higher levels of Cadmium are associated with a decrease in BMI Z-score.

  • hs_mibp_cadj_Log2: The negative coefficient (-0.077) indicates that higher levels of MiBP are associated with a decrease in BMI Z-score.

  • hs_total_fruits_Ter(7,14.1]: The positive coefficient (0.059) suggests that consuming fruits in this range is associated with an increase in the outcome variable.

These variables have the most significant influence on the outcome based on the coefficients from the group LASSO model.

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")

group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")

mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)

cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.042024
set.seed(101)

cv_group_ridge_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.04019086
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")

sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]

sig_vars_ridge_df <- data.frame(
  Variable = names(sig_vars_ridge),
  Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_test, s = "lambda.min", type = "response")

mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 1.03338

e3_yearbir_None2009: This variable indicates the year of birth of the children in 2009, having a positive coefficient (0.465), suggesting that being born in 2009 is associated with an increase in the response variable compared to the reference year.

hs_cu_c_Log2: This variable represents the log-transformed concentration of Copper (Cu) in serum. A positive coefficient (0.606) indicates that higher concentrations of copper are associated with an increase in the response variable.

hs_mbzp_cadj_Log2: This variable represents the log-transformed concentration of Mono-Benzyl Phthalate (MBzP), a type of phthalate, in serum. A positive coefficient (0.110) suggests that higher levels of MBzP are associated with an increase in the response variable.

hs_break_cer_Ter(5.5,Inf]: This variable represents a high intake of breakfast cereals (tertile > 5.5 servings per week). A negative coefficient (-0.115) indicates that higher consumption of breakfast cereals is associated with a decrease in the response variable.

hs_pbcb153_cadj_Log2: This variable represents the log-transformed concentration of Polychlorinated Biphenyl (PCB-153), a type of PCB, in serum. A negative coefficient (-0.199) suggests that higher levels of PCB-153 are associated with a decrease in the response variable.

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")

group_enet_predictions <- predict(group_enet_model, x_test, type = "response")

mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)

cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.043394
set.seed(101)

cv_group_enet_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.03662041
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")

sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]

sig_vars_enet_df <- data.frame(
  Variable = names(sig_vars_enet),
  Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_test, s = "lambda.min", type = "response")

mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 1.032168
  • e3_yearbir_None2009: The year of birth 2009 is positively associated (0.2673) with the outcome variable. This suggests that being born in 2009 is linked with an increase in the dependent variable.

  • hs_cu_c_Log2: Higher log-transformed copper levels are strongly positively associated (0.544) with the outcome, indicating that higher copper concentrations are linked with an increase in the dependent variable.

  • h_bfdur_Ter(10.8,34.9]: This breastfeeding duration category is positively associated (0.087) with the outcome, suggesting that breastfeeding for 10.8 to 34.9 months is linked with an increase in the dependent variable.

  • hs_pb_c_Log2: Higher log-transformed lead levels are negatively associated ((-0.096)) with the outcome, suggesting that higher lead concentrations are linked with a decrease in the dependent variable.

  • e3_yearbir_None2004: The year of birth 2004 is negatively associated (-0.125) with the outcome, indicating that being born in 2004 is linked with a decrease in the dependent variable.

Decision Tree

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Diet + Chemical + Covariates Decision Tree RMSE:", rmse_tree, "\n")
## Diet + Chemical + Covariates Decision Tree RMSE: 1.128749
printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2         hs_dde_cadj_Log2     hs_fastfood_Ter     
## [4] hs_mnbp_cadj_Log2    hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2 
## [7] hs_pfoa_c_Log2      
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.077388      0   1.00000 1.00089 0.050861
## 2  0.032769      1   0.92261 0.93652 0.048377
## 3  0.032579      2   0.88984 0.91682 0.046750
## 4  0.025008      3   0.85726 0.91218 0.046570
## 5  0.015671      4   0.83226 0.90443 0.045712
## 6  0.013674      5   0.81658 0.91485 0.047377
## 7  0.013602      6   0.80291 0.91478 0.047609
## 8  0.013517      7   0.78931 0.91478 0.047609
## 9  0.011314      8   0.77579 0.93397 0.049086
## 10 0.011047     10   0.75317 0.96270 0.050245
## 11 0.010296     11   0.74212 0.98215 0.050293
## 12 0.010000     12   0.73182 0.98607 0.050369
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Diet + Chemical + Covariates Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Diet + Chemical + Covariates Decision Tree RMSE: 1.089746
set.seed(101)

train_control <- trainControl(method = "cv", number = 10)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))

pruned_tree_model <- train(
  hs_zbmi_who ~ .,
  data = train_data,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.023
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)

rpart.plot(fit_tree_best)

unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)

mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)

mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.128749
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.089746

The RMSE for the unpruned decision tree is 1.129, and after pruning, the RMSE is 1.090. Pruning improves the model’s performance by reducing overfitting.

Important Variables:

  • hs_pcb170_cadj_Log2

  • hs_pbde153_cadj_Log2

  • hs_dde_cadj_Log2

  • hs_mnbp_cadj_Log2

  • hs_pfoa_c_Log2

These variables are crucial for splitting the data effectively in the tree.

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
par(mar = c(6, 14, 4, 4) + 0.1) 
varImpPlot(fit_rf, cex = 0.6)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        41.267546
## e3_sex_None               5.540866
## e3_yearbir_None          31.564485
## h_bfdur_Ter              13.136212
## hs_bakery_prod_Ter       20.689578
## hs_break_cer_Ter         13.679182
## hs_dairy_Ter             11.181729
## hs_fastfood_Ter          11.895794
## hs_org_food_Ter          10.856862
## hs_proc_meat_Ter         10.827847
## hs_total_fish_Ter        10.984827
## hs_total_fruits_Ter      13.977047
## hs_total_lipids_Ter      12.980031
## hs_total_sweets_Ter      10.749403
## hs_total_veg_Ter         12.258026
## hs_cd_c_Log2             44.886048
## hs_co_c_Log2             45.054544
## hs_cs_c_Log2             34.878871
## hs_cu_c_Log2             60.387604
## hs_hg_c_Log2             43.835480
## hs_mo_c_Log2             50.597285
## hs_pb_c_Log2             39.126962
## hs_dde_cadj_Log2         71.115165
## hs_pcb153_cadj_Log2      74.294671
## hs_pcb170_cadj_Log2     130.998985
## hs_dep_cadj_Log2         50.333164
## hs_pbde153_cadj_Log2     92.865989
## hs_pfhxs_c_Log2          38.593526
## hs_pfoa_c_Log2           53.392093
## hs_pfos_c_Log2           44.360304
## hs_prpa_cadj_Log2        43.848750
## hs_mbzp_cadj_Log2        49.661827
## hs_mibp_cadj_Log2        38.399318
## hs_mnbp_cadj_Log2        45.955350
rf_predictions <- predict(fit_rf, newdata = test_data)

mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.024848
rf_model <- train(
  x_train, y_train,
  method = "rf",
  trControl = train_control,
  tuneLength = 10
)

rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)

cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.025544

The RMSE for the Random Forest model is 1.025, and the cross-validated RMSE is 1.026, indicating strong performance.

Important Variables:

  • hs_pcb170_cadj_Log2

  • hs_pbde153_cadj_Log2

  • hs_dde_cadj_Log2

  • hs_cu_c_Log2

  • hs_pfoa_c_Log2

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.072432
gbm_model <- train(
  x_train, y_train,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10,
  verbose = FALSE
)

gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)

cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 1.048183

The RMSE for the GBM model is 1.072432, and the cross-validated RMSE is 1.048183, showing good accuracy and improvement over iterations.

Important Variables (based on Relative Influence):

  • hs_pcb170_cadj_Log2

  • hs_pbde153_cadj_Log2

  • hs_dde_cadj_Log2

  • hs_pfoa_c_Log2

  • hs_cu_c_Log2

These variables have the highest relative influence in the model, contributing significantly to reducing prediction errors.

Diet + Chemical + Metabolomic + Covariates

In terms of the metabolomic data, the values were excluded since there were too many entries that unable to be imputed by mean or median.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))

# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]

# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
ID metab_1 metab_2 metab_3 metab_4 metab_5 metab_6 metab_7 metab_8 metab_9 metab_10 metab_11 metab_12 metab_13 metab_14 metab_15 metab_16 metab_17 metab_18 metab_19 metab_20 metab_21 metab_22 metab_23 metab_24 metab_25 metab_26 metab_27 metab_28 metab_29 metab_30 metab_31 metab_32 metab_33 metab_34 metab_35 metab_36 metab_37 metab_38 metab_39 metab_40 metab_41 metab_42 metab_43 metab_44 metab_45 metab_46 metab_47 metab_48 metab_49 metab_50 metab_51 metab_52 metab_53 metab_54 metab_55 metab_56 metab_57 metab_58 metab_59 metab_60 metab_61 metab_62 metab_63 metab_64 metab_65 metab_66 metab_67 metab_68 metab_69 metab_70 metab_71 metab_72 metab_73 metab_74 metab_75 metab_76 metab_77 metab_78 metab_79 metab_80 metab_81 metab_82 metab_83 metab_84 metab_85 metab_86 metab_87 metab_88 metab_89 metab_90 metab_91 metab_92 metab_93 metab_94 metab_95 metab_96 metab_97 metab_98 metab_99 metab_100 metab_101 metab_102 metab_103 metab_104 metab_105 metab_106 metab_107 metab_108 metab_109 metab_110 metab_111 metab_112 metab_113 metab_114 metab_115 metab_116 metab_117 metab_118 metab_119 metab_120 metab_121 metab_122 metab_123 metab_124 metab_125 metab_126 metab_127 metab_128 metab_129 metab_130 metab_131 metab_132 metab_133 metab_134 metab_135 metab_136 metab_137 metab_138 metab_139 metab_140 metab_141 metab_142 metab_143 metab_144 metab_145 metab_146 metab_147 metab_148 metab_149 metab_150 metab_151 metab_152 metab_153 metab_154 metab_155 metab_156 metab_157 metab_158 metab_159 metab_160 metab_161 metab_162 metab_163 metab_164 metab_165 metab_166 metab_167 metab_168 metab_169 metab_170 metab_171 metab_172 metab_173 metab_174 metab_175 metab_176 metab_177
430 430 -2.15 -0.71 8.60 0.55 7.05 5.79 3.75 5.07 -1.87 -2.77 -3.31 -2.91 -2.94 -1.82 -4.40 -4.10 -5.41 -5.13 -5.35 -3.39 -5.08 -6.06 -6.06 -4.99 -4.46 -4.63 -3.27 -4.61 2.17 -1.73 -4.97 -4.90 -2.63 -5.29 -2.38 -4.06 -5.11 -5.35 -4.80 -3.92 -3.92 -5.47 -4.22 -2.56 -3.93 5.15 6.03 10.20 5.14 7.82 12.31 7.27 7.08 1.79 7.73 7.98 1.96 6.15 0.98 0.60 4.42 4.36 5.85 1.03 2.74 -2.53 -2.05 -2.91 -1.61 -1.63 5.03 0.14 6.23 -2.95 1.29 1.70 -2.83 4.55 4.05 2.56 -0.29 8.33 9.93 4.89 1.28 2.16 5.82 8.95 7.72 8.41 4.71 0.10 2.02 0.16 5.82 7.45 6.17 6.81 -0.70 -1.25 -0.65 2.05 3.39 4.94 -0.69 -1.44 -2.06 -2.44 -1.30 -0.73 -1.52 -2.43 -3.26 1.97 0.03 1.09 3.98 4.56 4.16 0.42 3.48 4.88 3.84 4.70 4.04 1.58 -0.76 1.75 2.48 4.43 4.68 3.29 0.97 1.03 0.44 1.55 2.26 2.72 0.12 -0.90 -0.50 0.02 -0.18 1.02 -2.69 -1.66 0.47 0.28 6.75 7.67 -2.66 -1.52 7.28 -0.08 2.39 1.55 3.01 2.92 -0.48 6.78 3.90 4.05 3.17 -1.46 3.56 4.60 -3.55 -2.79 -1.98 -1.84 3.98 6.47 7.16 -0.01 6.57 6.86 8.36
1187 1187 -0.69 -0.37 9.15 -1.33 6.89 5.81 4.26 5.08 -2.30 -3.42 -3.63 -3.16 -3.22 -1.57 -4.10 -5.35 -5.68 -6.11 -5.54 -3.50 -5.24 -5.72 -5.97 -4.94 -4.25 -4.46 -3.55 -4.64 1.81 -2.92 -4.44 -4.49 -3.53 -4.94 -3.15 -4.13 -4.47 -4.90 -4.24 -3.49 -3.94 -4.99 -4.02 -2.69 -3.69 5.13 5.57 9.93 6.13 8.47 12.32 6.83 5.94 1.64 6.82 7.74 1.98 6.11 0.99 0.19 4.34 4.36 5.47 0.92 2.69 -2.69 -1.93 -2.79 -1.63 -1.69 4.58 0.41 6.14 -3.06 1.05 2.10 -2.95 4.51 4.30 2.57 0.08 8.27 9.54 4.61 1.39 1.91 5.91 8.59 7.34 8.04 4.29 -0.04 2.17 0.42 5.39 6.95 5.68 6.09 -0.68 -1.29 -0.76 1.84 3.06 4.40 -0.52 -1.52 -1.90 -2.44 -1.46 -1.00 -1.33 -2.41 -3.67 2.48 0.27 1.02 4.19 4.43 4.19 0.33 3.24 4.38 3.92 5.09 4.42 1.01 -0.53 1.36 2.25 4.54 5.10 3.45 0.65 0.83 0.36 1.68 2.56 2.70 0.02 -1.02 -0.93 -0.22 0.11 1.60 -2.70 -1.31 1.08 0.54 6.29 7.97 -3.22 -1.34 7.50 0.48 2.19 1.49 3.09 2.71 -0.38 6.86 3.77 4.31 3.23 -1.82 3.80 5.05 -3.31 -2.18 -2.21 -2.01 4.91 6.84 7.14 0.14 6.03 6.55 7.91
940 940 -0.69 -0.36 8.95 -0.13 7.10 5.86 4.35 5.92 -1.97 -3.40 -3.41 -2.99 -3.01 -1.65 -3.55 -4.82 -5.41 -5.84 -5.13 -2.83 -4.86 -5.51 -5.51 -4.63 -3.73 -4.00 -2.92 -4.21 2.79 -1.41 -4.80 -5.47 -2.10 -5.47 -2.14 -4.18 -4.84 -5.24 -4.64 -3.20 -3.90 -5.24 -3.77 -2.70 -2.76 5.21 5.86 9.78 6.38 8.29 12.49 7.01 6.49 1.97 7.17 7.62 2.40 6.93 1.85 1.45 5.11 5.30 6.27 2.35 3.31 -2.50 -1.41 -2.61 -0.93 -1.03 4.54 1.59 6.03 -2.74 1.79 2.68 -8.16 5.19 5.14 3.16 0.24 9.09 10.25 5.44 1.90 2.46 6.66 9.19 8.24 8.46 5.73 1.10 2.58 1.15 6.37 7.28 6.51 7.20 -0.48 -0.69 -0.02 2.56 3.76 5.33 -0.16 -1.18 -1.18 -2.16 -1.06 -0.19 -0.48 -2.35 -3.16 2.79 0.72 2.14 4.80 4.84 4.55 1.27 4.26 5.23 4.40 5.43 4.56 2.32 0.03 2.15 3.22 5.06 5.28 3.80 1.38 1.58 0.98 2.27 2.94 3.39 0.33 -0.53 0.17 0.53 0.57 1.69 -2.21 -0.76 1.25 0.49 6.49 8.84 -4.02 -1.33 7.42 0.71 2.81 2.03 3.30 3.00 -0.24 7.02 3.82 4.66 3.36 -1.18 3.82 4.91 -2.95 -2.89 -2.43 -2.05 4.25 7.02 7.36 0.14 6.57 6.68 8.12
936 936 -0.19 -0.34 8.54 -0.62 7.01 5.95 4.24 5.41 -1.89 -2.84 -3.38 -3.11 -2.94 -1.45 -3.83 -4.43 -5.61 -5.41 -5.54 -2.94 -4.78 -6.06 -5.88 -4.70 -4.82 -4.46 -2.66 -3.82 2.85 -2.70 -5.16 -5.47 -3.31 -5.61 -2.80 -4.11 -4.97 -4.86 -5.01 -3.63 -3.78 -5.29 -4.17 -2.49 -3.65 5.31 5.60 9.87 6.67 8.05 12.33 6.72 6.42 1.25 7.28 7.37 1.99 6.28 1.17 0.50 4.52 4.43 5.54 1.30 3.08 -2.92 -2.16 -3.18 -1.66 -1.63 4.55 0.53 5.73 -3.27 1.30 1.70 -2.57 4.53 4.14 2.61 -0.18 8.32 9.62 4.82 1.58 1.99 5.82 8.59 7.58 8.39 4.68 0.36 2.01 -0.31 5.71 7.35 6.22 6.66 -0.70 -1.42 -0.62 2.13 3.54 4.85 -0.72 -1.53 -2.04 -2.37 -1.38 -0.96 -1.57 -2.91 -3.60 2.37 0.21 0.92 4.05 4.27 4.33 0.24 3.38 4.45 3.71 4.74 4.44 1.51 -1.73 1.51 2.27 4.37 4.89 3.40 0.66 0.83 0.27 1.50 2.30 2.60 0.14 -0.90 -0.99 -0.53 -0.30 1.14 -3.06 -1.69 0.39 0.19 6.21 8.05 -2.75 -0.87 7.79 0.87 2.48 1.62 3.28 2.93 -0.41 6.91 3.75 4.38 3.20 -1.07 3.81 4.89 -3.36 -2.40 -2.06 -2.03 3.99 7.36 6.94 0.14 6.26 6.47 7.98
788 788 -1.96 -0.35 8.73 -0.80 6.90 5.95 4.88 5.39 -1.55 -2.45 -3.51 -2.84 -2.83 -1.71 -3.91 -4.05 -5.61 -4.63 -5.29 -3.51 -4.86 -5.97 -5.27 -4.90 -4.40 -4.63 -3.11 -3.99 2.87 -2.23 -4.61 -5.04 -3.53 -5.08 -3.02 -4.41 -4.72 -5.18 -4.72 -3.63 -3.61 -5.29 -4.05 -2.31 -3.73 4.69 5.31 9.69 6.76 8.21 12.18 6.75 6.51 1.15 7.38 7.93 1.76 5.68 -0.02 -0.65 4.14 3.36 4.43 0.21 1.98 -2.31 -1.54 -2.30 -1.66 -1.47 4.48 0.88 6.47 -2.50 0.74 1.12 -2.17 4.31 3.50 2.09 -0.60 8.06 9.69 3.99 0.54 1.60 5.60 8.71 7.32 8.03 3.27 -0.98 1.59 -0.20 5.68 7.16 5.57 6.16 -0.79 -1.31 -0.87 2.17 3.23 4.57 -0.93 -1.80 -2.27 -2.51 -1.74 -1.02 -1.92 -2.02 -3.79 1.95 -0.24 0.40 3.73 4.13 3.71 0.03 2.89 4.06 3.54 4.76 3.88 0.53 -2.11 1.27 1.99 4.13 4.58 2.88 0.22 0.39 0.22 1.44 2.02 2.22 0.00 -0.81 -1.10 -0.41 -0.09 1.00 -2.66 -1.55 0.33 0.19 6.47 7.89 -4.40 -1.94 7.65 0.38 1.66 0.84 2.78 2.26 -0.84 6.52 3.53 3.81 2.83 -1.69 3.65 4.47 -3.81 -2.97 -2.88 -2.29 3.88 6.99 7.38 -0.10 6.00 6.52 8.04
698 698 -1.90 -0.63 8.24 -0.46 6.94 5.42 4.70 4.62 -1.78 -3.14 -3.46 -2.90 -2.94 -1.65 -4.20 -4.56 -5.68 -5.61 -5.41 -2.92 -5.04 -5.97 -6.06 -4.90 -4.22 -4.20 -3.05 -4.61 2.15 -2.87 -4.68 -5.08 -3.69 -5.24 -3.63 -4.24 -5.16 -5.35 -4.97 -3.61 -3.99 -5.35 -3.98 -2.59 -3.95 5.15 5.82 10.00 5.54 8.15 12.28 6.80 6.23 1.88 7.07 7.38 2.06 6.79 1.67 1.00 4.79 4.79 5.71 1.99 3.29 -2.13 -1.01 -1.85 -1.23 -0.90 4.41 -0.02 6.09 -2.10 1.66 2.27 -3.48 4.96 4.76 2.64 0.05 8.91 9.99 5.16 1.53 2.11 6.28 8.77 8.03 8.66 5.99 0.87 2.30 0.63 6.23 7.50 6.75 7.22 -0.45 -0.81 -0.11 2.57 3.93 5.16 -0.31 -1.19 -1.25 -1.93 -0.89 0.07 -0.87 -1.12 -3.03 2.61 0.54 1.83 4.50 4.53 4.42 1.15 4.02 4.91 4.06 5.06 4.42 2.02 -1.03 1.87 2.96 4.84 5.08 3.62 1.13 1.23 0.75 2.26 2.80 3.04 0.41 -0.39 0.02 0.31 0.52 1.73 -2.28 -0.73 1.06 0.72 6.44 7.27 -3.08 -1.23 7.35 0.92 2.60 2.00 3.69 3.20 -0.25 7.38 4.15 5.00 3.88 -1.39 4.31 5.20 -3.47 -2.75 -1.97 -1.96 4.18 6.81 6.75 0.02 6.49 5.97 7.78
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)

outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
  dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")

all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

selected_id_data <- cbind(outcome_and_cov, extracted_exposome)

# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)

selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)

Group LASSO

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who

train_control <- trainControl(method = "cv", number = 10)

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1  # subtract for the ID column

group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals),      # Group 3: Chemicals
  rep(4, num_metabolomics)    # Group 4: Metabolomics
)

if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # this should be TRUE
## [1] TRUE
# fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9334466
set.seed(101)
cv_group_lasso <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_lasso$lambda.min, "\n")
## Optimal lambda: 0.01400055
coef_lasso <- coef(cv_group_lasso, s = "lambda.min")

sig_vars_lasso <- coef_lasso[coef_lasso != 0]
sig_vars_lasso <- sig_vars_lasso[names(sig_vars_lasso) != "(Intercept)"]

sig_vars_lasso_df <- data.frame(
  Variable = names(sig_vars_lasso),
  Coefficient = as.numeric(sig_vars_lasso)
)
sig_vars_lasso_df
group_lasso_predictions <- predict(cv_group_lasso, x_test, s = "lambda.min", type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)
cat("Group LASSO RMSE with 10-fold CV:", rmse_group_lasso, "\n")
## Group LASSO RMSE with 10-fold CV: 0.8748226

Group Lasso performs well with the lowest test RMSE and CV RMSE among the penalized regression models. The smaller optimal lambda indicates that fewer variables are being selected, resulting in a more parsimonious model with important predictors. This model balances variable selection and prediction accuracy effectively.

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9424999
set.seed(101)

cv_group_ridge_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_ridge_model$lambda.min, "\n")
## Optimal lambda: 0.02446636
coef_ridge <- coef(cv_group_ridge_model, s = "lambda.min")

sig_vars_ridge <- coef_ridge[coef_ridge != 0]
sig_vars_ridge <- sig_vars_ridge[names(sig_vars_ridge) != "(Intercept)"]

sig_vars_ridge_df <- data.frame(
  Variable = names(sig_vars_ridge),
  Coefficient = as.numeric(sig_vars_ridge)
)
sig_vars_ridge_df
group_ridge_predictions <- predict(cv_group_ridge_model, x_test, s = "lambda.min", type = "response")

mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge RMSE with 10-fold CV:", rmse_group_ridge, "\n")
## Group Ridge RMSE with 10-fold CV: 0.8849928

Group Ridge has slightly higher RMSE values compared to Group Lasso. The optimal lambda is higher, indicating more variables are included in the model. Ridge regression typically shrinks coefficients but includes all variables, which might result in less interpretability but robust predictions.

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9440745
set.seed(101)

cv_group_enet_model <- cv.grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian", nfolds = 10)

cat("Optimal lambda:", cv_group_enet_model$lambda.min, "\n")
## Optimal lambda: 0.0185079
coef_enet <- coef(cv_group_enet_model, s = "lambda.min")

sig_vars_enet <- coef_enet[coef_enet != 0]
sig_vars_enet <- sig_vars_enet[names(sig_vars_enet) != "(Intercept)"]

sig_vars_enet_df <- data.frame(
  Variable = names(sig_vars_enet),
  Coefficient = as.numeric(sig_vars_enet)
)
sig_vars_enet_df
group_enet_predictions <- predict(cv_group_enet_model, x_test, s = "lambda.min", type = "response")

mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net RMSE with 10-fold CV:", rmse_group_enet, "\n")
## Group Elastic Net RMSE with 10-fold CV: 0.8844928

Group Elastic Net combines Lasso and Ridge penalties, resulting in similar performance to Group Ridge. It provides a balance between variable selection and coefficient shrinkage, but with slightly higher RMSE values than Group Lasso.

Decision Trees

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, 
                                  list = FALSE, 
                                  times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

tree_predictions <- predict(fit_tree, newdata = test_data)
mse_tree <- mean((test_data$hs_zbmi_who - tree_predictions)^2)
rmse_tree <- sqrt(mse_tree)

cat("Full Decision Tree RMSE:", rmse_tree, "\n")
## Full Decision Tree RMSE: 1.243108
printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
##  [1] hs_cu_c_Log2        hs_dde_cadj_Log2    hs_pcb170_cadj_Log2
##  [4] metab_100           metab_129           metab_140          
##  [7] metab_141           metab_142           metab_157          
## [10] metab_161           metab_176           metab_47           
## [13] metab_58            metab_8             metab_94           
## [16] metab_95           
## 
## Root node error: 1254/841 = 1.4911
## 
## n= 841 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.089116      0   1.00000 1.00275 0.052776
## 2  0.069078      1   0.91088 0.94590 0.050339
## 3  0.037249      2   0.84181 0.91981 0.049729
## 4  0.034324      3   0.80456 0.90111 0.048711
## 5  0.025576      4   0.77023 0.89055 0.049604
## 6  0.021936      5   0.74466 0.90166 0.048993
## 7  0.021276      6   0.72272 0.90895 0.049515
## 8  0.020244      8   0.68017 0.90807 0.050310
## 9  0.016938      9   0.65992 0.91277 0.050356
## 10 0.015553     10   0.64299 0.90306 0.050272
## 11 0.015178     11   0.62743 0.90122 0.050368
## 12 0.014756     12   0.61226 0.90066 0.050300
## 13 0.014022     13   0.59750 0.91203 0.051412
## 14 0.013859     14   0.58348 0.92227 0.051732
## 15 0.012010     15   0.56962 0.93488 0.053622
## 16 0.011088     16   0.55761 0.93327 0.053609
## 17 0.011027     17   0.54652 0.92988 0.053560
## 18 0.010000     18   0.53549 0.93467 0.053891
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.12821
set.seed(101)

train_control <- trainControl(method = "cv", number = 10)

cp_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.001))

pruned_tree_model <- train(
  hs_zbmi_who ~ .,
  data = train_data,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

best_cp <- pruned_tree_model$bestTune$cp
cat("Best cp value from cross-validation:", best_cp, "\n")
## Best cp value from cross-validation: 0.036
fit_tree_unpruned <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

fit_tree_best <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova", control = rpart.control(cp = best_cp))
rpart.plot(fit_tree_unpruned)

rpart.plot(fit_tree_best)

unpruned_tree_predictions <- predict(fit_tree_unpruned, newdata = test_data)
pruned_tree_predictions <- predict(fit_tree_best, newdata = test_data)

mse_unpruned_tree <- mean((test_data$hs_zbmi_who - unpruned_tree_predictions)^2)
rmse_unpruned_tree <- sqrt(mse_unpruned_tree)

mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Unpruned Decision Tree RMSE:", rmse_unpruned_tree, "\n")
## Unpruned Decision Tree RMSE: 1.243108
cat("Pruned Decision Tree RMSE with Best cp:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE with Best cp: 1.127666

Pruning improves the decision tree’s performance, but it still has a higher RMSE compared to penalized regression models. Pruning helps to reduce overfitting but does not achieve the same level of predictive accuracy.

Random Forest

set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
##                       IncNodePurity
## hs_child_age_None         4.7927709
## e3_sex_None               0.5707810
## e3_yearbir_None           5.3687139
## hs_cd_c_Log2              4.8057550
## hs_co_c_Log2              5.5171622
## hs_cs_c_Log2              5.0356667
## hs_cu_c_Log2             12.4603237
## hs_hg_c_Log2              6.5609155
## hs_mo_c_Log2              9.6711331
## hs_pb_c_Log2              6.0711093
## hs_dde_cadj_Log2         13.2357957
## hs_pcb153_cadj_Log2      42.5218764
## hs_pcb170_cadj_Log2      78.6536355
## hs_dep_cadj_Log2          5.7451267
## hs_pbde153_cadj_Log2     28.0252092
## hs_pfhxs_c_Log2           5.4200119
## hs_pfoa_c_Log2            9.0245854
## hs_pfos_c_Log2            6.9947002
## hs_prpa_cadj_Log2         5.0401167
## hs_mbzp_cadj_Log2         4.4169868
## hs_mibp_cadj_Log2         4.0028339
## hs_mnbp_cadj_Log2         5.0996136
## h_bfdur_Ter               2.8968599
## hs_bakery_prod_Ter        2.9013737
## hs_dairy_Ter              1.2216138
## hs_fastfood_Ter           0.9580573
## hs_org_food_Ter           1.1570809
## hs_readymade_Ter          1.4572844
## hs_total_bread_Ter        1.1592172
## hs_total_fish_Ter         1.4961459
## hs_total_fruits_Ter       1.1767595
## hs_total_lipids_Ter       1.3061318
## hs_total_potatoes_Ter     1.2547008
## hs_total_sweets_Ter       0.9540810
## hs_total_veg_Ter          1.0142937
## metab_1                   4.5011841
## metab_2                   4.1531041
## metab_3                   3.4379265
## metab_4                   5.7831834
## metab_5                   2.9178829
## metab_6                   9.2084145
## metab_7                   4.5028527
## metab_8                  36.1613265
## metab_9                   2.8921966
## metab_10                  3.3253911
## metab_11                  4.0194203
## metab_12                  3.3756205
## metab_13                  4.1227190
## metab_14                  5.2837840
## metab_15                  4.4314031
## metab_16                  2.6462109
## metab_17                  2.4025892
## metab_18                  3.0563512
## metab_19                  2.2951339
## metab_20                  3.9067077
## metab_21                  2.8607845
## metab_22                  2.6869075
## metab_23                  2.8972758
## metab_24                  3.7219939
## metab_25                  3.4797174
## metab_26                  7.4516177
## metab_27                  3.3263961
## metab_28                  3.6858414
## metab_29                  3.3029111
## metab_30                 18.5312981
## metab_31                  3.4436727
## metab_32                  3.0545040
## metab_33                  5.2038360
## metab_34                  2.4118531
## metab_35                  7.7143364
## metab_36                  3.8489845
## metab_37                  3.5293186
## metab_38                  2.5404708
## metab_39                  2.5165610
## metab_40                  5.1546331
## metab_41                  3.8921235
## metab_42                  6.1284140
## metab_43                  3.4008043
## metab_44                  3.5698540
## metab_45                  3.6758543
## metab_46                  5.1802108
## metab_47                  6.2120208
## metab_48                 11.7806465
## metab_49                 33.0932505
## metab_50                  9.9343588
## metab_51                  6.2231737
## metab_52                  3.3042007
## metab_53                  5.1286048
## metab_54                  4.9482602
## metab_55                  7.9980726
## metab_56                  4.6904911
## metab_57                  4.7919533
## metab_58                  3.2769554
## metab_59                  5.7319035
## metab_60                  3.9192335
## metab_61                  3.0283668
## metab_62                  4.3267745
## metab_63                  4.8698592
## metab_64                  3.9307501
## metab_65                  3.8785658
## metab_66                  2.5032525
## metab_67                  2.7330496
## metab_68                  3.5029673
## metab_69                  2.4526201
## metab_70                  3.2796735
## metab_71                  4.6222410
## metab_72                  3.4044271
## metab_73                  3.0460977
## metab_74                  2.7018733
## metab_75                  3.8889307
## metab_76                  2.8113720
## metab_77                  4.7832198
## metab_78                  4.9096271
## metab_79                  4.0912870
## metab_80                  3.5334220
## metab_81                  3.3897748
## metab_82                  4.6950398
## metab_83                  3.3638699
## metab_84                  3.0885189
## metab_85                  4.6423208
## metab_86                  3.2104952
## metab_87                  3.8282082
## metab_88                  2.8852207
## metab_89                  2.7940235
## metab_90                  2.6775937
## metab_91                  3.2999703
## metab_92                  3.4644294
## metab_93                  2.8208312
## metab_94                  8.2011811
## metab_95                 51.6744901
## metab_96                  8.1118854
## metab_97                  3.0816526
## metab_98                  3.2655000
## metab_99                  4.4506345
## metab_100                 3.9391564
## metab_101                 3.6374127
## metab_102                 5.7070910
## metab_103                 4.1270357
## metab_104                 5.6319496
## metab_105                 3.3000728
## metab_106                 3.4540736
## metab_107                 3.8068032
## metab_108                 4.0689264
## metab_109                 5.3846332
## metab_110                 7.8695016
## metab_111                 2.7218197
## metab_112                 2.7613608
## metab_113                 5.5357192
## metab_114                 4.3740721
## metab_115                 4.3476546
## metab_116                 6.1735729
## metab_117                 7.0544259
## metab_118                 2.4601605
## metab_119                 5.4450427
## metab_120                 7.1242577
## metab_121                 3.7051871
## metab_122                 6.2282665
## metab_123                 3.2628782
## metab_124                 3.3382452
## metab_125                 3.4201616
## metab_126                 2.4057902
## metab_127                 5.0861475
## metab_128                 6.4178268
## metab_129                 4.2203870
## metab_130                 3.4512521
## metab_131                 3.2548319
## metab_132                 3.2654363
## metab_133                 2.6708217
## metab_134                 3.2328840
## metab_135                 5.2587159
## metab_136                 5.0406508
## metab_137                 6.3400908
## metab_138                 4.7455979
## metab_139                 3.7303384
## metab_140                 2.7325221
## metab_141                 7.9353350
## metab_142                13.5320442
## metab_143                 9.6865623
## metab_144                 3.9033383
## metab_145                 4.2785181
## metab_146                 4.2770667
## metab_147                 3.5426166
## metab_148                 3.3025573
## metab_149                 5.6260285
## metab_150                 4.9789354
## metab_151                 3.5939464
## metab_152                 4.7580724
## metab_153                 4.3991465
## metab_154                 5.0953881
## metab_155                 2.5069093
## metab_156                 2.8819060
## metab_157                 4.3767383
## metab_158                 3.9225685
## metab_159                 3.3643529
## metab_160                 7.0114089
## metab_161                28.1188870
## metab_162                 3.9630717
## metab_163                14.4399347
## metab_164                 7.1747718
## metab_165                 3.9310471
## metab_166                 3.8429719
## metab_167                 3.3589423
## metab_168                 3.1719266
## metab_169                 4.2720474
## metab_170                 4.2149550
## metab_171                 4.9419993
## metab_172                 3.7296383
## metab_173                 4.0956131
## metab_174                 4.6193564
## metab_175                 3.7661891
## metab_176                 6.2453467
## metab_177                14.2757880
par(mar = c(6, 14, 4, 4) + 0.1) 
varImpPlot(rf_model, cex = 0.6)

rf_predictions <- predict(rf_model, newdata = test_data)

rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)

cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.004679
rf_model <- train(
  x_train, y_train,
  method = "rf",
  trControl = train_control,
  tuneLength = 10
)

rf_predictions_cv <- predict(rf_model, x_test)
mse_rf_cv <- mean((y_test - rf_predictions_cv)^2)
rmse_rf_cv <- sqrt(mse_rf_cv)

cat("10-Fold CV Random Forest RMSE:", rmse_rf_cv, "\n")
## 10-Fold CV Random Forest RMSE: 1.004854

Random Forest performs better than the decision trees but worse than the penalized regression models. The consistent RMSE between the test and CV indicates stable performance. Random Forests are robust but may not capture the data’s underlying patterns as effectively as the penalized regression models in this context.

GBM

gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data, 
                 distribution = "gaussian",
                 n.trees = 1000,
                 interaction.depth = 3,
                 n.minobsinnode = 10,
                 shrinkage = 0.01,
                 cv.folds = 5,
                 verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.4860             nan     0.0100    0.0031
##      2        1.4799             nan     0.0100    0.0042
##      3        1.4751             nan     0.0100    0.0037
##      4        1.4720             nan     0.0100    0.0007
##      5        1.4668             nan     0.0100    0.0043
##      6        1.4622             nan     0.0100    0.0017
##      7        1.4565             nan     0.0100    0.0049
##      8        1.4519             nan     0.0100    0.0038
##      9        1.4470             nan     0.0100    0.0034
##     10        1.4429             nan     0.0100    0.0026
##     20        1.3973             nan     0.0100    0.0033
##     40        1.3199             nan     0.0100    0.0026
##     60        1.2519             nan     0.0100    0.0015
##     80        1.1922             nan     0.0100    0.0020
##    100        1.1425             nan     0.0100    0.0008
##    120        1.0943             nan     0.0100    0.0013
##    140        1.0515             nan     0.0100    0.0003
##    160        1.0125             nan     0.0100    0.0003
##    180        0.9736             nan     0.0100    0.0003
##    200        0.9397             nan     0.0100    0.0008
##    220        0.9070             nan     0.0100    0.0012
##    240        0.8800             nan     0.0100    0.0009
##    260        0.8535             nan     0.0100   -0.0001
##    280        0.8275             nan     0.0100    0.0005
##    300        0.8029             nan     0.0100    0.0002
##    320        0.7809             nan     0.0100    0.0003
##    340        0.7594             nan     0.0100    0.0003
##    360        0.7400             nan     0.0100    0.0000
##    380        0.7220             nan     0.0100    0.0000
##    400        0.7042             nan     0.0100    0.0003
##    420        0.6874             nan     0.0100    0.0002
##    440        0.6711             nan     0.0100    0.0003
##    460        0.6552             nan     0.0100    0.0000
##    480        0.6392             nan     0.0100   -0.0001
##    500        0.6237             nan     0.0100   -0.0002
##    520        0.6107             nan     0.0100    0.0000
##    540        0.5974             nan     0.0100    0.0000
##    560        0.5852             nan     0.0100   -0.0002
##    580        0.5728             nan     0.0100    0.0001
##    600        0.5612             nan     0.0100   -0.0001
##    620        0.5500             nan     0.0100   -0.0004
##    640        0.5386             nan     0.0100    0.0002
##    660        0.5285             nan     0.0100    0.0000
##    680        0.5187             nan     0.0100   -0.0001
##    700        0.5086             nan     0.0100   -0.0001
##    720        0.4995             nan     0.0100   -0.0001
##    740        0.4900             nan     0.0100    0.0001
##    760        0.4813             nan     0.0100   -0.0001
##    780        0.4729             nan     0.0100   -0.0000
##    800        0.4645             nan     0.0100   -0.0001
##    820        0.4565             nan     0.0100   -0.0002
##    840        0.4491             nan     0.0100   -0.0001
##    860        0.4409             nan     0.0100    0.0000
##    880        0.4336             nan     0.0100   -0.0002
##    900        0.4267             nan     0.0100   -0.0001
##    920        0.4197             nan     0.0100   -0.0001
##    940        0.4133             nan     0.0100   -0.0001
##    960        0.4067             nan     0.0100   -0.0000
##    980        0.4007             nan     0.0100   -0.0002
##   1000        0.3943             nan     0.0100   -0.0001
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")

predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.9537
gbm_model <- train(
  x_train, y_train,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10,
  verbose = FALSE
)

gbm_predictions_cv <- predict(gbm_model, x_test)
mse_gbm_cv <- mean((y_test - gbm_predictions_cv)^2)
rmse_gbm_cv <- sqrt(mse_gbm_cv)

cat("10-Fold CV GBM RMSE:", rmse_gbm_cv, "\n")
## 10-Fold CV GBM RMSE: 0.9526345

GBM performs well with RMSE values close to the penalized regression models, particularly Group Elastic Net and Ridge. GBM’s iterative boosting process helps improve predictive accuracy, but it still falls slightly short of the performance of Group Lasso.

Model Comparison

Group Lasso has the lowest RMSE values, indicating the best predictive performance and effective variable selection. Group Ridge, Group Elastic Net, and GBM also perform well, with RMSE values close to Group Lasso. These models offer a balance between prediction accuracy and interpretability. Decision Trees and Random Forests are less effective for this particular dataset and problem.

Overall Comparison of RMSE

results <- data.frame(
  Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
  Data_Set = rep(c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"), 6),
  RMSE = c(1.152, 1.139, 1.041, 1.131, 0.875,
           1.149, 1.129, 1.035, 1.033, 0.885,
           1.152, 1.139, 1.035, 1.032, 0.885,
           1.155, 1.149, 1.090, 1.090, 1.128,
           1.155, 1.129, 1.032, 1.026, 1.005,
           1.150, 1.136, 1.040, 1.048, 0.953)
)

results$Data_Set <- factor(results$Data_Set, levels = c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"))

rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(rmse_plot)

# filter results for Lasso model
lasso_results <- subset(results, Model == "Lasso")

rmse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Lasso Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_lasso_plot)

# filter results for ridge model
ridge_results <- subset(results, Model == "Ridge")

rmse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Ridge Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_ridge_plot)

# filter results for elastic net model
enet_results <- subset(results, Model == "Elastic Net")

rmse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Elastic Net Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_enet_plot)

# filter results for decision tree model
tree_results <- subset(results, Model == "Decision Tree")

rmse_tree_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Decision Tree Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_tree_plot)

# filter results for random forest model
rf_results <- subset(results, Model == "Random Forest")

rmse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Random Forest Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_rf_plot)

# filter results for elastic net model
gbm_results <- subset(results, Model == "GBM")

rmse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "GBM Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(rmse_gbm_plot)

Conclusion

Overall Findings

Group Lasso, Ridge, and Elastic Net Models:

  • Consistently performed well across different variable sets, with the lowest RMSEs observed in the full model including diet, chemical, metabolomic, and covariates.

  • The addition of metabolomic data significantly improved model performance, suggesting their strong predictive power in relation to BMI.

Decision Trees:

  • Showed moderate performance, with pruning improving RMSE values.

  • The full decision tree models were less effective compared to regularized regression models.

Random Forest and GBM:

  • Demonstrated robust performance, with GBM slightly outperforming Random Forest.

  • The inclusion of diverse data types (diet, chemical, metabolomic) led to improved RMSE values, highlighting the benefit of a comprehensive data approach.

Overall, a combined approach leveraging demographic, dietary, chemical, and metabolomic data provides the most accurate predictions for BMI, with regularized regression techniques (Lasso, Ridge, Elastic Net) and ensemble methods (Random Forest, GBM) offering the best performance.

Limitations

Future